Flux-core: flux PMI will not init spectrum MPI

Created on 23 Mar 2018  Â·  54Comments  Â·  Source: flux-framework/flux-core

See below:

scogland at sierra4358 in ~/expariments  (FLUX:local:///var/tmp/flux-PWuPz2)
$ flux wreckrun -n 4 ./a.out
2018-03-23T16:35:24.298395Z sched.err[0]: job 3 bad state transition from reserved to starting
2018-03-23T16:35:24.298414Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.298431Z sched.err[0]: job_state_cb: failed to invoke callbacks
2018-03-23T16:35:24.360057Z sched.err[0]: job 3 bad state transition from reserved to running
2018-03-23T16:35:24.360075Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.360092Z sched.err[0]: job_state_cb: failed to invoke callbacks
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79265] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
2018-03-23T16:35:24.441870Z sched.err[0]: job 3 bad state transition from reserved to complete
2018-03-23T16:35:24.441888Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.441904Z sched.err[0]: job_state_cb: failed to invoke callbacks
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79266] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79267] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79264] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
wreckrun: tasks [0-3]: exited with exit code 1

Most helpful comment

heh: >< - good tomoticon!

All 54 comments

Not sure if this will help but running with -o trace-pmi-server might give more information about what the flux PMI server is seeing (if anything).

I think what’s going on is that spectrum is using an older OpenMPI
than the one that we got flux support into. Not 100% sure, but I get
the impression it just doesn’t even try to talk to us right now. =(

On 23 Mar 2018, at 9:40, Mark Grondona wrote:

Not sure if this will help but running with -o trace-pmi-server
might give more information about what the flux PMI server is seeing
(if anything).

--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1382#issuecomment-375726780

018-03-23T16:35:24.298395Z sched.err[0]: job 3 bad state transition from reserved to starting
2018-03-23T16:35:24.298414Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.298431Z sched.err[0]: job_state_cb: failed to invoke callbacks
2018-03-23T16:35:24.360057Z sched.err[0]: job 3 bad state transition from reserved to running
2018-03-23T16:35:24.360075Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.360092Z sched.err[0]: job_state_cb: failed to invoke callbacks

Just a side note: but the state transition needs to be beefed up on abnormal transitions like this.

@trws: if you have specific questions, I can talk to Spectrum MPI guys at IBM. I guess the main question is:

Does Spectrum MPI uses PMI (or PMIX)? And do they have recipe to make Spectrum MPI talk to another bootstrapped like flux that implements normal PMI?

It uses PMIX, which should work with us if they’re using a recent
enough version IIRC. We’ll have to figure out exactly what to ask,
but it may be as simple as asking them to compile in the flux support
module for their MPI’s internal PMI implementation.

On 23 Mar 2018, at 10:06, Dong H. Ahn wrote:

@trws: if you have specific questions, I can talk to Spectrum MPI guys
at IBM. I guess the main question is:

Does Spectrum MPI uses PMI (or PMIX)? And do they have recipe to make
Spectrum MPI talk to another bootstrapped like flux that implements
normal PMI?

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1382#issuecomment-375735103

Also in theory, we should be able to use OpenMPI. We can use a version of OMPI for which we tested flux's support. I know we installed those on EA systems but I'm not sure if we have on Sierra systems. Let me ask.

Talked to Adam Moody; will start a separate email discussion thread.

It uses PMIX, which should work with us if they’re using a recent
enough version IIRC.

What does work is Flux can be launched wtih PMIX's backwards compatibility support for PMI-1.

If the MPI wants only PMIX, then Flux can't launch it. I think what we added to OMPI was support for the PMI-1 wire protocol which we offer and which can be used wtihout having to relink MPI against our PMI library. If this is a variant of OMPI, maybe that could be backported?

ompi_info output might be helpful, e.g.

$ /opt/openmpi/2.x-dev/bin/ompi_info|grep pmi
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA pmix: pmix2x (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v3.0.0)
$ /opt/openmpi/2.x-dev/bin/ompi_info|grep flux
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v3.0.0)

See #923 for more detailed info on OMPI's "flux support". I made pretty detailed commit comments in OMPI when this was added, in case anyone needs to dig into this.

Uh, looks like ralph squashed my whole PR down to one commit in the merge: https://github.com/open-mpi/ompi/commit/215d6290e00f5306fb31610325bea99e40b30a6b and concatenated all my coments so it seems a bit like a run-on.

sierra4359{dahn}36: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2018.02.05/bin/ompi_info | grep pmi
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v10.2.0)

Maybe there is an runtime mca option to turn on flux support... If not, building OpenMPI ourselves would be the path to least resistance. I don't believe we build spectrum mpi on our own.

FYI -- From Roy Mussleman:

Hi Dong,

No, I have not ported OpenMPI to sierra.

The CORAL EA Power8 build process is on rzmanta at /usr/tcetmp/packages/openmpi/openmpi-2.0.2-gcc-4.8.5/src

Looks like Chris Earl was playing with openmpi-2.0.0-clang-3.9.1

Bob Walkup gave me this process ( patches and configure ) over a year ago before he had spectrum_mpi to work with.
He recently indicated there is a problem on sierra with compatibility between OpenMPI and jsrun.
"Spectrum MPI does not expose the pmix software that would be required to make openmpi work with the spectrum mpi jsrun"
So you may need to go the mpirun path in an interactive session. Maybe similar to what Adam provided for mvapich.

For Power9, you'll need to change the reference to power8 in the build.sh script.
It uses the mellanox collectives (mxm)

Would spack have a semi-official version ?

Not sure if this is pertinent, but we did run into this problem with openmpi built on TOSS 3:

TOSS-3153 openmpi should not set rpath for libpmi.so

Here's a new fun detail, setting FLUX_JOB_SIZE and FLUX_JOB_NNODES kills the spectrum mpi mpirun... ><

Looks like if FLUX_JOB_ID is set, it tries to do something that causes it to segfault.

heh: >< - good tomoticon!

@trws: What happens if you select flux component for pmix type?

mpirun -n 4 --mca pmix flux hello_world

I know ultimately you want to use flux wreckrun but if pmix=flux actually activate flux support within SpectrumMPI, we can simply pass this MCA key value pair... Maybe we are already doing this though...

I'll try that, it would be a less nasty solution.

@trws: I know some of us will be busy with SC18 submissions and spring break next two weeks, I think i will be good to summarize the issues we need to unblock and "good to haves" for splash effort.

I think I may be able to fit emitting trimmed R for affinity and optimizing rdesc fetching rdesc using @grondo's experimental wreck. Anything else?

That and the combined cancellation/kill are the only things that would
be really good for splash right now. The rest is more documentation of
issues to address “at some point” and things that will be blockers
for ATS.

On 25 Mar 2018, at 10:47, Dong H. Ahn wrote:

@trws: I know some of us will be busy with SC18 submissions and spring
break next two weeks, I think i will be good to summarize the issues
we need to unblock and "good to haves" for splash effort.

I think I may be able to fit emitting trimmed R for affinity and
optimizing rdesc fetching rdesc using @grondo's experimental wreck.
Anything else?

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1382#issuecomment-375989231

Some notes on how ompi under flux is supposed to work, and a quick test on my desktop to ensure we haven't regressed anything.

First ensure that a hello world mpi program can be compiled with ompi and run under flux (yup):

$ /opt/openmpi/2.x-dev/ompi_info  --version
Open MPI v3.0.0a1
$ /opt/openmpi/2.x-dev/bin/mpicc -o hello.ompi hello.c
$ flux wreckrun -n 2 ./hello.ompi
0: completed MPI_Init in 0.058s.  There are 2 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.007s

Exercise PMI client side debug (prove that ompi flux support opened flux PMI library)

$ FLUX_PMI_DEBUG=1 flux wreckrun  -n 2 ./hello.ompi
FLUX_PMI_DEBUG=1 flux wreckrun -n 2 ./hello.ompi 
PMI_Init: PMI_FD is set, selecting simple_client
PMI_Init: PMI_FD is set, selecting simple_client
1: PMI_Init rc=0 
1: PMI_KVS_Get_value_length_max rc=0 
...
1: PMI_Barrier rc=0 
1: PMI_Finalize rc=0 
0: PMI_Barrier rc=0 
0: PMI_Finalize rc=0 
0: completed MPI_Finalize in 0.014s

Exercise PMI server side debug (prove that flux PMI library connected to PMI_FD provided by wrexed):

$ flux wreckrun -o trace_pmi_server -n2 ./hello.ompi
1: C: cmd=init pmi_version=1 pmi_subversion=1
1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=get_maxes
1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
...
1: S: cmd=finalize_ack rc=0
0: C: cmd=finalize
0: S: cmd=finalize_ack rc=0
0: completed MPI_Finalize in 0.009s

The two ompi flux modules are: mca_pmix_flux.so and mca_schizo_flux.so.

mca_pmix_flux.so

mca_pmix_flux.so requires the following environment variables to be set.
Both are set for all tasks launched through wreckrun:

FLUX_JOB_ID

The module dlopens the flux pmi library using the above environment variable, then translates ompi generic pmi-ish calls to PMI-1 API calls supplied by our PMI library.

lib/flux/libpmi.so

The flux PMI library tries the following, in descending priority:

  1. if PMI_FD environment var is set, talk PMI-1 wire protocol to wrexecd on it (set by wrexecd - this is what we want!)
  2. if PMIX_SERVER_URI is set, dlopen libpmix.so and redirect PMI calls to PMI-1 API there
  3. if PMI_LIBRARY is set, dlopen that library and redirect PMI calls to PMI-1 API there

Regardless of what the flux PMI library chooses to do here, FLUX_PMI_DEBUG should tell you that the flux PMI library was called, and if it is dlopening another PMI library, what it passed to dlopen.

mca_schizo_flux.so

I can't make heads or tails of the mca_schizo_flux.so component. It seems to be all boilerplate (provided by Ralph I think, and since he squashed all my and his commits together it's impossible to tell if I'm just forgetting) and is part of orte.

I seem to recall it needs to be there to ensure the other module runs, but no idea how.

@trws: I said I would find the runes for getting an ompi-linked MPI program to emit some debug.
This looks like it might be promising:

$ OMPI_MCA_pmix_base_verbose=255 flux wreckrun -n2 ./hello.ompi
[jimbo:05553] mca: base: components_register: registering framework pmix components
[jimbo:05553] mca: base: components_register: found loaded component isolated
[jimbo:05553] mca: base: components_register: component isolated has no register or open function
[jimbo:05553] mca: base: components_register: found loaded component pmix2x
[jimbo:05553] mca: base: components_register: component pmix2x register function successful
[jimbo:05553] mca: base: components_register: found loaded component flux
[jimbo:05553] mca: base: components_register: component flux register function successful
[jimbo:05553] mca: base: components_open: opening pmix components
[jimbo:05553] mca: base: components_open: found loaded component isolated
[jimbo:05553] mca: base: components_open: component isolated open function successful
[jimbo:05553] mca: base: components_open: found loaded component pmix2x
[jimbo:05553] mca: base: components_open: component pmix2x open function successful
[jimbo:05553] mca: base: components_open: found loaded component flux
[jimbo:05553] mca:base:select: Auto-selecting pmix components
[jimbo:05553] mca:base:select:( pmix) Querying component [isolated]
[jimbo:05553] mca:base:select:( pmix) Query of component [isolated] set priority to 0
[jimbo:05553] mca:base:select:( pmix) Querying component [pmix2x]
[jimbo:05553] mca:base:select:( pmix) Query of component [pmix2x] set priority to 5
[jimbo:05553] mca:base:select:( pmix) Querying component [flux]
[jimbo:05553] mca:base:select:( pmix) Query of component [flux] set priority to 20
[jimbo:05553] mca:base:select:( pmix) Selected component [flux]
[jimbo:05553] mca: base: close: component isolated closed
[jimbo:05553] mca: base: close: unloading component isolated
[jimbo:05553] mca: base: close: component pmix2x closed
[jimbo:05553] mca: base: close: unloading component pmix2x
[jimbo:05553] [[0,45],1] pmix:flux: assigned tmp name
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.lrank
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.lrank
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.nrank
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.nrank
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.max.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.max.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.job.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.job.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.appnum
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.appnum
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.local.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.local.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.num.nodes
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.tmpdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.nsdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.pdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.tdir.rmclean
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.ltopo
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:flux put for key pmix.cpuset
[jimbo:05553] [[0,45],1] pmix:flux put for key opal.puri
[jimbo:05553] [[0,45],1] pmix:flux put for key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux put for key MPI_THREAD_LEVEL
[jimbo:05553] [[0,45],1] pmix:flux put for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.loc
# snip - output for other rank
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.loc
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.loc
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.loc
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux called get for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux got key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux called get for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:flux got key btl.tcp.3.0
[jimbo:05553] mca: base: close: unloading component flux
0: completed MPI_Init in 0.099s.  There are 2 tasks
0: completed first barrier in 0.024s
0: completed MPI_Finalize in 0.022s

A new bit of info here, it's also true the other way around. The spectrum MPI mpirun, orterun and jsrun will not bootstrap flux either. I'm not sure if this is an issue with something that changed in a newer PMIX or something particular to IBM's implementation of PMIX, but it may be worth looking into at some point, or at least a good reason to write something that can launch flux under LSF that doesn't require mpich...

When we do look at this, a good thing to try would be to set FLUX_PMI_DEBUG=1 in the environment and try to run a small job with the native launch tool(s). This will cause trace information from our PMI client (used by the broker) to go to stderr. (See example in earlier comment)

So can we launch flux as a batch job directly under LSF or are no native options for launching flux on that machine?

I'm not sure what blaunch would do on manta/ray, but on sierra jsrun is the official LSF-sanctioned launcher, so there is currently no native option for launching flux there. I would say that I think we could launch it with a config file with the native tools, but no PMI wireup can be expected at the moment.

Just requested butte/sierra access and will try to debug issues with Flux launching spectrum MPI apps directly.

I'm on, and flux-core master builds fine (--disable-jobspec needed).

I did hit these make check failures

in t2000-wreck.t:

expecting success: 
    run_timeout 15 flux wreckrun -v -n$(($(nproc)*${SIZE}+1)) /bin/true

wreckrun: 0.011s: Registered jobid 21
wreckrun: 0.012s: State = reserved
wreckrun: 0.013s: job.submit: Function not implemented
wreckrun: Allocating 513 tasks across 4 available nodes..
wreckrun: tasks per node: node[0-2]: 129, node3: 126
wreckrun: 0.019s: Sending run event
2018-04-10T00:16:54.760719Z connector-local.err[2]: send kvs.lookup response to client 605C5: Broken pipe
2018-04-10T00:16:54.760325Z connector-local.err[1]: send kvs.lookup response to client E93A5: Broken pipe
wreckrun: Killed by SIGALRM: state = reserved
not ok 20 - wreckrun: oversubscription of tasks

In t0001-basic.t:

expecting success: 
    size=$(test_size_large)  &&
    test -n "$size" &&
    size=$(FLUX_TEST_SIZE_MAX=2 test_size_large) &&
    test "$size" = "2" &&
    size=$(FLUX_TEST_SIZE_MIN=123 FLUX_TEST_SIZE_MAX=1000 test_size_large) &&
    test "$size" = "123"

not ok 49 - builtin test_size_large () works
#   
#       size=$(test_size_large)  &&
#       test -n "$size" &&
#       size=$(FLUX_TEST_SIZE_MAX=2 test_size_large) &&
#       test "$size" = "2" &&
#       size=$(FLUX_TEST_SIZE_MIN=123 FLUX_TEST_SIZE_MAX=1000 test_size_large) &&
#       test "$size" = "123"
#   

Heading out, just wanted to document where I was in this investigation.

Here's an initial baseline, building a test executable and running it under spectrum mpirun and wreckrun with debug.

Build t/mpi/hello.ibm test program with:

CC=/opt/ibm/spectrum_mpi/bin/mpicc

all: hello.ibm

hello.ibm:
    $(CC) -o $@ hello.c

Ldd output looks like this

$ ldd hello.ibm
    linux-vdso64.so.1 =>  (0x0000100000000000)
    libmpiprofilesupport.so.3 => /opt/ibm/spectrum_mpi/lib/libmpiprofilesupport.so.3 (0x0000100000060000)
    libmpi_ibm.so.3 => /opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3 (0x0000100000080000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00001000001d0000)
    libc.so.6 => /lib64/libc.so.6 (0x0000100000210000)
    librt.so.1 => /lib64/librt.so.1 (0x0000100000400000)
    libutil.so.1 => /lib64/libutil.so.1 (0x0000100000430000)
    libhwloc.so.5 => /opt/ibm/spectrum_mpi/lib/libhwloc.so.5 (0x0000100000460000)
    libm.so.6 => /lib64/libm.so.6 (0x00001000004b0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00001000005a0000)
    libevent-2.0.so.5 => /opt/ibm/spectrum_mpi/lib/libevent-2.0.so.5 (0x00001000005d0000)
    libevent_pthreads-2.0.so.5 => /opt/ibm/spectrum_mpi/lib/libevent_pthreads-2.0.so.5 (0x0000100000630000)
    libopen-rte.so.3 => /opt/ibm/spectrum_mpi/lib/libopen-rte.so.3 (0x0000100000650000)
    libopen-pal.so.3 => /opt/ibm/spectrum_mpi/lib/libopen-pal.so.3 (0x0000100000740000)
    /lib64/ld64.so.2 (0x0000000020000000)

Native run:

$ /opt/ibm/spectrum_mpi/bin/mpirun -n 2 ./hello.ibm
0: completed MPI_Init in 0.147s.  There are 2 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.052s

Native run with debug

OMPI_MCA_pmix_base_verbose=255 /opt/ibm/spectrum_mpi/bin/mpirun -n 2 ./hello.ibm
[butte5:123607] mca: base: components_register: registering framework pmix components
[butte5:123607] mca: base: components_register: found loaded component ext2x
[butte5:123607] mca: base: components_register: component ext2x has no register or open function
[butte5:123607] mca: base: components_register: found loaded component flux
[butte5:123607] mca: base: components_register: component flux register function successful
[butte5:123607] mca: base: components_open: opening pmix components
[butte5:123607] mca: base: components_open: found loaded component ext2x
[butte5:123607] mca: base: components_open: component ext2x open function successful
[butte5:123607] mca: base: components_open: found loaded component flux
[butte5:123607] mca:base:select: Auto-selecting pmix components
[butte5:123607] mca:base:select:( pmix) Querying component [ext2x]
[butte5:123607] mca:base:select:( pmix) Query of component [ext2x] set priority to 5
[butte5:123607] mca:base:select:( pmix) Querying component [flux]
[butte5:123607] mca:base:select:( pmix) Selected component [ext2x]
[butte5:123607] mca: base: close: unloading component flux
[butte5:123607] posting notification recv on tag 0
[butte5:123607] pmix:server init called
[snip]

Now run under flux with debug. Hmm, doesn't seem to get far enough to even emit debug.

OMPI_MCA_pmix_base_verbose=255 flux wreckrun -n2 ./hello.ibm
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
--------------------------------------------------------------------------
But I couldn't open the help file:
Sorry!  You were supposed to get help about:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
    orte_init:startup:internal-failure
--------------------------------------------------------------------------
But I couldn't open the help file:
--------------------------------------------------------------------------
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
Sorry!  You were supposed to get help about:
--------------------------------------------------------------------------
    mpi_init:startup:internal-failure
--------------------------------------------------------------------------
But I couldn't open the help file:
Sorry!  You were supposed to get help about:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
    mpi_init:startup:internal-failure
--------------------------------------------------------------------------
But I couldn't open the help file:
*** An error occurred in MPI_Init
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
*** on a NULL communicator
--------------------------------------------------------------------------
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** An error occurred in MPI_Init
***    and potentially your MPI job)
*** on a NULL communicator
[butte5:126191] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[butte5:126190] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
wreckrun: tasks [0-1]: exited with exit code 1

You may get some earlier messages adding some other verbose flags, I think mca_base_verbose comes up first and manages the choice of other modules, but there are a ton of them and I haven't tried them all.

Bleah, it is very tedious looking for an appropriate option. I did find this one, which just confirms that MPI_Init() called ompi_mpi_abort() which is not news.

$ OMPI_MCA_mpi_abort_print_stack=1 flux wreckrun -n 1 ./hello.ibm
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[butte5:50259] [0] func:/opt/ibm/spectrum_mpi/lib/libopen-pal.so.3(opal_backtrace_buffer+0x3c) [0x10000081684c]
*** An error occurred in MPI_Init
[butte5:50259] [1] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_mpi_abort+0x36c) [0x10000014387c]
*** on a NULL communicator
[butte5:50259] [2] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_mpi_errors_are_fatal_comm_handler+0xdc) [0x10000012cc3c]
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[butte5:50259] [3] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_errhandler_invoke+0x1ec) [0x10000012bd1c]
***    and potentially your MPI job)
[butte5:50259] [4] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(MPI_Init+0xcc) [0x10000017307c]
[butte5:50260] [0] func:/opt/ibm/spectrum_mpi/lib/libopen-pal.so.3(opal_backtrace_buffer+0x3c) [0x10000081684c]
[butte5:50259] [5] func:./hello.ibm() [0x10000c34]
[butte5:50260] [1] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_mpi_abort+0x36c) [0x10000014387c]
[butte5:50259] [6] func:/lib64/libc.so.6(+0x24980) [0x100000284980]
[butte5:50260] [2] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_mpi_errors_are_fatal_comm_handler+0xdc) [0x10000012cc3c]
[butte5:50259] [7] func:/lib64/libc.so.6(__libc_start_main+0xc4) [0x100000284b74]
[butte5:50260] [3] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(ompi_errhandler_invoke+0x1ec) [0x10000012bd1c]
[butte5:50259] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[butte5:50260] [4] func:/opt/ibm/spectrum_mpi/lib/libmpi_ibm.so.3(MPI_Init+0xcc) [0x10000017307c]
[butte5:50260] [5] func:./hello.ibm() [0x10000c34]
[butte5:50260] [6] func:/lib64/libc.so.6(+0x24980) [0x100000284980]
[butte5:50260] [7] func:/lib64/libc.so.6(__libc_start_main+0xc4) [0x100000284b74]
[snip]

FWIW, environment set up by spectrum mpirun

$ /opt/ibm/spectrum_mpi/bin/mpirun -n 1 printenv
PMIX_INSTALL_PREFIX=/opt/ibm/spectrum_mpi
OMPI_MCA_mpi_leave_pinned=1
OMPI_MCA_memory=patcher
OMPI_MCA_mca_base_component_show_load_errors=0
OMPI_MCA_coll_tuned_priority=-1
OMPI_MCA_coll_hcoll_priority=-1
OMPI_MCA_coll_hcoll_enable=0
OMPI_MCA_schizo_ompi_prepend_ld_library_path=pami_port
OMPI_MCA_opal_signal=7,8,11
OMPI_MCA_pml=pami
OMPI_MCA_osc=pami
OMPI_MCA_btl=self
OMPI_LD_PRELOAD_POSTPEND_DISTRO=/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
OMPI_MCA_mca_base_env_list_distro=MPI_ROOT,OPAL_PREFIX,OPAL_LIBDIR,PMIX_INSTALL_PREFIX,HWLOC_PLUGINS_PATH,SMPI_HCOLL_ENABLE_BCAST,SMPI_HCOLL_ENABLE_ALLTOALLV,SMPI_HCOLL_ENABLE_GATHER,SMPI_HCOLL_ENABLE_IALLTOALL,OMPI_LD_PRELOAD_POSTPEND_DISTRO,LD_LIBRARY_PATH
OMPI_MCA_pmix=^s1,s2,cray,isolated
PMIX_SERVER_TMPDIR=/var/tmp/garlick/ompi.butte5.5588/pid.67253/0/0
MPI_ROOT=/opt/ibm/spectrum_mpi
OPAL_PREFIX=/opt/ibm/spectrum_mpi
OPAL_LIBDIR=/opt/ibm/spectrum_mpi/lib
HWLOC_PLUGINS_PATH=/opt/ibm/spectrum_mpi/lib/hwloc
SMPI_HCOLL_ENABLE_BCAST=0
SMPI_HCOLL_ENABLE_ALLTOALLV=0
SMPI_HCOLL_ENABLE_GATHER=0
SMPI_HCOLL_ENABLE_IALLTOALL=0
LD_LIBRARY_PATH=/opt/ibm/spectrum_mpi/lib/pami_port:/opt/ibm/spectrum_mpi/lib:/opt/ibm/spectrum_mpi/lib
OMPI_COMMAND=printenv
OMPI_MCA_orte_precondition_transports=62f1069d83bf8724-32b799fcc2d7a7cb
LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
[snip]
OMPI_MCA_orte_local_daemon_uri=986710016.0;tcp://192.168.64.2,134.9.50.105,134.9.6.11,192.168.128.5:48161;ud://659.12.1
OMPI_MCA_orte_hnp_uri=986710016.0;tcp://192.168.64.2,134.9.50.105,134.9.6.11,192.168.128.5:48161;ud://659.12.1
OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=32
OMPI_MCA_orte_num_nodes=1
OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
OMPI_MCA_orte_bound_at_launch=1
OMPI_MCA_ess=^singleton
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_tmpdir_base=/var/tmp/garlick
OMPI_MCA_orte_top_session_dir=/var/tmp/garlick/ompi.butte5.5588
OMPI_MCA_orte_jobfam_session_dir=/var/tmp/garlick/ompi.butte5.5588/pid.67253
OMPI_NUM_APP_CTX=1
OMPI_FIRST_RANKS=0
OMPI_APP_CTX_NUM_PROCS=1
OMPI_MCA_initial_wdir=/g/g0/garlick/flux-core/t/mpi
OMPI_MCA_orte_launch=1
PMIX_NAMESPACE=986710017
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.67253;tcp4://127.0.0.1:43879
PMIX_SECURITY_MODE=native,none
PMIX_PTL_MODULE=tcp,usock
PMIX_DSTORE_ESH_BASE_PATH=/var/tmp/garlick/ompi.butte5.5588/pid.67253/pmix_dstor_67253
OMPI_MCA_ess_base_jobid=986710017
OMPI_MCA_ess_base_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_MCA_orte_ess_node_rank=0
PMIX_ID=986710017.0
OMPI_FILE_LOCATION=/var/tmp/garlick/ompi.butte5.5588/pid.67253/0/0
OPAL_OUTPUT_STDERR_FD=36

Well, by randomly poking I found two critical environment variables that allow mpi to find itself:

MPI_ROOT=/opt/ibm/spectrum_mpi
OPAL_LIBDIR=/opt/ibm/spectrum_mpi/lib

With those two set, a two-task MPI hello (on the same node) runs to completion under Flux, albeit spewing a bunch of warnings:

MPI_ROOT=/opt/ibm/spectrum_mpi \
OPAL_LIBDIR=/opt/ibm/spectrum_mpi/lib \
flux wreckrun -n 2 ./hello.ibm
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'butte5', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
--------------------------------------------------------------------------
Open MPI requires that every physically separate IB subnet that is
WARNING: There are more than one active ports on host 'butte5', but the
used between connected MPI processes must have different subnet ID
default subnet GID prefix was detected on more than one of these
values.
ports.  If these ports are connected to different physical IB

networks, this configuration will fail in Open MPI.  This version of
Please see this FAQ entry for more details:
Open MPI requires that every physically separate IB subnet that is

used between connected MPI processes must have different subnet ID
  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
values.


NOTE: You can turn off this warning by setting the MCA parameter
Please see this FAQ entry for more details:
      btl_openib_warn_default_gid_prefix to 0.

--------------------------------------------------------------------------
  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
--------------------------------------------------------------------------
[1523389210.396984] [butte5:128322:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 

A process has executed an operation involving a call to the
NOTE: You can turn off this warning by setting the MCA parameter
"fork()" system call to create a child process.  Open MPI is currently
      btl_openib_warn_default_gid_prefix to 0.
operating in a condition that could result in memory corruption or
[1523389210.396866] [butte5:128323:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 
--------------------------------------------------------------------------
other system errors; your job may hang, crash, or produce silent
--------------------------------------------------------------------------
data corruption.  The use of fork() (or system() or other calls that
A process has executed an operation involving a call to the
create child processes) is strongly discouraged.
"fork()" system call to create a child process.  Open MPI is currently

operating in a condition that could result in memory corruption or
The process that invoked fork was:
other system errors; your job may hang, crash, or produce silent

data corruption.  The use of fork() (or system() or other calls that
  Local host:          [[0,26],0] (PID 128322)
create child processes) is strongly discouraged.


If you are *absolutely sure* that your application will successfully
The process that invoked fork was:
and correctly survive a call to fork(), you may disable this warning

by setting the mpi_warn_on_fork MCA parameter to 0.
  Local host:          [[0,26],1] (PID 128323)
--------------------------------------------------------------------------

[butte5:128322] mca_base_component_repository_open: unable to open mca_pml_pami: libpami.so.3: cannot open shared object file: No such file or directory (ignored)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[butte5:128323] mca_base_component_repository_open: unable to open mca_pml_pami: libpami.so.3: cannot open shared object file: No such file or directory (ignored)
[butte5:128322] mca_base_component_repository_open: unable to open mca_coll_hcoll: libhcoll.so.1: cannot open shared object file: No such file or directory (ignored)
[butte5:128323] mca_base_component_repository_open: unable to open mca_coll_hcoll: libhcoll.so.1: cannot open shared object file: No such file or directory (ignored)
[butte5:128322] mca_base_component_repository_open: unable to open mca_coll_ibm: libcollectives.so.3: cannot open shared object file: No such file or directory (ignored)
[butte5:128322] mca_base_component_repository_open: unable to open mca_osc_pami: libpami.so.3: cannot open shared object file: No such file or directory (ignored)
[butte5:128323] mca_base_component_repository_open: unable to open mca_coll_ibm: libcollectives.so.3: cannot open shared object file: No such file or directory (ignored)
[butte5:128323] mca_base_component_repository_open: unable to open mca_osc_pami: libpami.so.3: cannot open shared object file: No such file or directory (ignored)
[1523389210.396984] [butte5:128322:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 
[1523389210.425558] [butte5:128322:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 
[1523389210.396866] [butte5:128323:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 
[1523389210.425558] [butte5:128323:0]         mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240) 
0: completed MPI_Init in 0.929s.  There are 2 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.434s

If run with OMPI_MCA_pmix_base_verbose=255, one can see that Flux's PMI is being successfully used to bootstrap.

So, a little progress.

That is definitely progress!

I admit the fork warning surprises/confuses me though, how would we end up with that inside the actual spectrum MPI process?

So remind me again - have you been able to launch spectrum mpi applications with anything other than the spectrum mpirun? I'm a little worried the pami stuff is not self-contained in the application, and relies on something in orte.

Spectrum mpirun and jsrun, which is IBM's launcher to go with LSF on these things. I managed to get flux to successfully launch a multi-node MPI job with spectrum just now, but only by turning pami off. It looks like we'll have to enlist IBM to actually get a fix for this:

(sierrapysplash) splash:flux$ OMPI_MCA_osc=pt2pt OMPI_MCA_pml=yalla OMPI_MCA_btl=self MPI_ROOT=/opt/ibm/spectrum_mpi OPAL_LIBDIR=/opt/ibm/spectrum_mpi/lib flux wreckrun -N 4 env LD_LIBRARY_PATH=/opt/ibm/spectrum_mpi/lib/pami_port:/opt/ibm/spectrum_mpi/lib:/opt/ibm/spectrum_mpi/lib:/opt/mellanox/hcoll/lib OMPI_MCA_coll_hcoll_enable=0 bash -c 'ulimit -s 10240 ; env LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so ~/flux-base/mpitest-spectrum '
Hello world from processor sierra1414, rank 0 out of 4 processors
Hello world from processor sierra3369, rank 1 out of 4 processors
Hello world from processor sierra1415, rank 2 out of 4 processors
Hello world from processor sierra1416, rank 3 out of 4 processors

Spectrum mpirun and jsrun, which is IBM's launcher to go with LSF on these things. I managed to get flux to successfully launch a multi-node MPI job with spectrum just now, but only by turning pami off. It looks like we'll have to enlist IBM to actually get a fix for this:

Great!

We need to involve Roy Mussleman to get this to be fixed by IBM ASAP. Do you want to come up to 4th floor for quick to strategize with Roy? I will give him a quick heads-up as well.

I would like to, but I’m in Santa Clara at the moment. Will you be
around tomorrow?

On 10 Apr 2018, at 14:16, Dong H. Ahn wrote:

Spectrum mpirun and jsrun, which is IBM's launcher to go with LSF on
these things. I managed to get flux to successfully launch a
multi-node MPI job with spectrum just now, but only by turning pami
off. It looks like we'll have to enlist IBM to actually get a fix for
this:

Great!

We need to involve Roy Mussleman to get this to be fixed by IBM ASAP.
Do you want to come up to 4th floor for quick to strategize with Roy?
I will give him a quick heads-up as well.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1382#issuecomment-380249297

The potentially deeper worry here is that flux doesn't seem to work for OpenMPI builds that rely on these kinds of paths in the environment. I'm not sure if there's anything we can, or should, do about that from our end though. @rhc54 is there anything we can tie into that would make the flux-end handling for OpenMPI environment (prefix/libdir/mpi_root) requirements a little more robust?

I would like to, but I’m in Santa Clara at the moment. Will you be
around tomorrow?

I will be.

I just talked to Mussleman. He said the fastest route to get IBM's response would be to describe the problem in an email and send it to MPI/PAMI developers directly. And he has a couple of names. If there is a way to work around this in time, they are the ones who can provide the info or who can forward our inquiry.

@trws: can you send an email to Mussleman and copy me on? His email is [email protected].

@garlick Sorry the squash caused confusion. There has been some argument in the OMPI world about having a lot of "in-between" commits. Schizo just checks for markers of a particular environment (flux, in your case) and sets things up to ensure the right components get selected (in your case, the flux PMI one).

@trws I'm not sure there is a great solution for the problem. OMPI by itself seems to be okay in that regard, but Spectrum does some nasty things with the environment - the timing of the "schizo" framework's development didn't dovetail into their initial efforts, and so mpirun is now a wrapper that fiddles with things before calling the real mpirun. This is what causes the fragility so far we we've heard from folks.

Jim's flux work should be just fine - I confess we don't track/test it, but nothing has changed in those areas of the code. I can try to advise as you run into things, if that would help.

Thanks for that clarification @rhc54!

Poking around in /opt/ibm/spectrum_mpi it does appear that bin/mpirun is a wrapper for bin/stock/mpirun. Maybe we can learn something about how this works from the wrapper code. Good hint!

@garlick: if you have specific questions, feel free to involve me and Mussleman. We have contact info for some Spectrum MPI developers.

Thanks @rhc54, it looks like I had a bad build of openmpi that was making me think we needed a more general fix. Should we warn people to build with anything to make sure they get the right prefix by default, or is that all handled in schizo?

Assuming IBM doesn't interfere, you can configure OMPI with --enable-orterun-prefix-by-default and that should ensure things are always set.

Ok. It may be worth putting a reference to that in our PMI docs, not
that it’s required to work with us necessarily, but we don’t have a
good way to work around those paths being missing.

On 11 Apr 2018, at 12:04, Ralph Castain wrote:

Assuming IBM doesn't interfere, you can configure OMPI with
--enable-orterun-prefix-by-default and that should ensure things
are always set.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1382#issuecomment-380561811

OK 2 things came out from the concall with IBM folks.

  1. It turned out the PAMI layer relies on PMIX. So flux (which supports only PMI) won't be able bootstrap a Spectrum MPI job. IBM's recommendation was to support PMIX from within the flux instance.

  2. There are other environment variables that a Spectrum MPI job depends on. The easiest way to export all of them is to execute alias.pl which is currently a part of JSM/Spectrum MPI installation. (Warning was this can change in the future, so it is not fully future proof.)

Related issue #1555.

In #1555 @rhc54 said:

Given the problems, you may find it simpler to just add PMIx support to flux. MPICH now supports PMIx, so I'm not sure what you gain by sticking with the older libraries, and it would allow you to smoothly move between JSM and flux.

It's interesting that this PAMI layer (library?) is, I guess, independently bootstrapping itself through PMIX, as opposed to being implemented as a plugin to OpenMPI where it would have access to OpenMPI's internal PMIish interfaces that work with multiple resource managers including Flux. Probably we're not going to be able change that though.

I wonder if we can offer PMIX support in Flux by simply exporting a libpmix.so that implements the API, or if we'll have to implement PMIX's wire protocol, security, etc? I guess it depends on how libpami uses PMIX?

@garlick:

my current thinking is:

  1. Using libpmix.so that is bundled with Spectrum MPI will allow us to bootstrap a flux instance with jsrun. I think one can play with writting a minimalistic PMI wrapper for feasibility + helping the current push of MLSI.

  2. Implementing our own PMIX from within flux will allow this flux instance to run spectrum MPI jobs.

We probably don't want to rely on the PMIX server running on the node (which was used to launch flux) in launching MPI jobs within the flux instance, though.

There are certainly PAMI bits implemented as OpenMPI plugins, and none of them are using PMIx_ symbols directly.

$ find /opt/ibm/spectrum_mpi  -name \*pami\*.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libmca_common_pamiopal.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libmca_common_pami.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libpami_cudahook.so
/opt/ibm/spectrum_mpi/lib/spectrum_mpi/mca_osc_pami.so
/opt/ibm/spectrum_mpi/lib/spectrum_mpi/mca_pml_pami.so
/opt/ibm/spectrum_mpi/lib/libmca_common_pamiopal.so
/opt/ibm/spectrum_mpi/lib/pami_433/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port_dt/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port_ftdt/libpami.so
/opt/ibm/spectrum_mpi/lib/libmca_common_pami.so
/opt/ibm/spectrum_mpi/lib/pami_port_ft/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_noib/libpami.so
/opt/ibm/spectrum_mpi/lib/mpicoll/libpami.so
/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
$ nm `find /opt/ibm/spectrum_mpi -name \*pami\*.so` |grep PMIx_
$

Not sure if that is the extent of the pami code though. Maybe another library is lurking somewhere else?

Sent an email directly to Josh Hursey @IBM and cc'ed you.

Josh can help you better than I given his direct knowledge of the PAMI code. My understanding is that PAMI pulls all the PMIx data out of the local JSM daemon that hosts the PMIx server library, but I don't know what interfaces they use to do it. They might dlopen it, which is why it wouldn't show in a dependency listing.

One clarification just to ensure we are on the same page: there is no separate PMIx server running on the node. JSM's daemon acts as the PMIx server on each node (i.e., it calls PMIx server_init). However, I do agree that if you launch the flux instance, you would certainly want flux to handle the MPI wireup.

If there are concerns blocking your direct use of PMIx, we'd love to understand them and see if we can't resolve them. Ideally, we'd like to see flux hosting a PMIx server as there are increasingly more things being provided thru the PMIx library (e.g., comm cost matrix for scheduling, fabric topology, and storage directives).

If there are concerns blocking your direct use of PMIx, we'd love to understand them and see if we can't resolve them. Ideally, we'd like to see flux hosting a PMIx server as there are increasingly more things being provided thru the PMIx library (e.g., comm cost matrix for scheduling, fabric topology, and storage directives).

This was one hangup that made integrating the "reference server" code difficult for us: pmix/pmix#102

If the wire protocol is now nailed down and documented, we could maybe implement our own server.

Dropping the "in progress" label since I am not actively working on this.

Is there anything we should add to 0.10.0 to make this easier? We do have these lua scripts that provide some environment settings needed by various MPI's, but they are all loaded unconditionally. Would it make sense to provide a way to conditionally set them, e.g. so you could launch --with-mpi=spectrum or similar?

Actually, that sounds extremely useful. I hadn't thought about it for a long time, but that's something I remember wishing for any number of times when working on the MPI end of the equation.


From: Jim Garlick notifications@github.com
Sent: Wednesday, July 11, 2018 7:07:16 AM
To: flux-framework/flux-core
Cc: Scogland, Tom; Mention
Subject: Re: [flux-framework/flux-core] flux PMI will not init spectrum MPI (#1382)

Dropping the "in progress" label since I am not actively working on this.

Is there anything we should add to 0.10.0 to make this easier? We do have these lua scripts that provide some environment settings needed by various MPI's, but they are all loaded unconditionally. Would it make sense to provide a way to conditionally set them, e.g. so you could launch --with-mpi=spectrum or similar?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1382#issuecomment-404182515, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStUx6UeHqHyll-ZwEq5kjLbuGHjXlks5uFgaUgaJpZM4S5B-p.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dongahn picture dongahn  Â·  7Comments

SteVwonder picture SteVwonder  Â·  7Comments

garlick picture garlick  Â·  3Comments

SteVwonder picture SteVwonder  Â·  7Comments

SteVwonder picture SteVwonder  Â·  7Comments