Singularity: Error running Singularity container with mpi

Created on 21 May 2018 · 38Comments · Source: hpcng/singularity

Version of Singularity:

2.5.1-dist

Write here.

Expected behavior

successfully run the container on a cluster and produce similar results as running the binary
Write here.

Actual behavior

mpiexec.hydra -hostfile ./nodelist -n 4 -ppn 2 singularity run --app multinode namd.multi.img stmv 39
This is what happens when you run the namd multinode container...
running stmv workoad now
This is what happens when you run the namd multinode container...
running stmv workoad now
[cli_1]: write_line error; fd=16 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_1]: Unable to write to PMI_fd
[cli_1]: write_line error; fd=16 buf=:cmd=barrier_in
:
system msg for write_line failure : Bad file descriptor
[cli_1]: write_line error; fd=16 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : Bad file descriptor
[cli_1]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[cli_1]: write_line error; fd=16 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : Bad file descriptor
all done!
[cli_0]: write_line error; fd=14 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=14 buf=:cmd=barrier_in
:
system msg for write_line failure : Bad file descriptor
[cli_0]: write_line error; fd=14 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : Bad file descriptor
[cli_0]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[cli_0]: write_line error; fd=14 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : Bad file descriptor
all done!
This is what happens when you run the namd multinode container...
This is what happens when you run the namd multinode container...

Write here.

Steps to reproduce behavior

Config file:
https://github.com/intel/Intel-HPC-Container/blob/master/containers/namd/Singularity.namd

$ singularity pull --name namd.multi.img shub://intel/Intel-HPC-Container:multi.namd

$ mpiexec.hydra -hostfile ./nodelist -n 4 -ppn 2 singularity run --app multinode namd.multi.img stmv 39

Source

Smahane

Most helpful comment

@jmstover
Thanks a lot for your explaination.

If I run mpirun within the container, it seems very difficult to run on multinode. I googled and somebody said to use ssh wrapper.....

Thanks again

xuzheng97 on 3 Jul 2018

👍2

All 38 comments

Hi @Smahane ... Were you able to work around this issue?

jmstover on 22 May 2018

After farther testing, it appears to be an environment issue not a Singularity issue.

Thanks

Smahane on 22 May 2018

@jmstover unfortunately I have't figured out how to work around this yet or what is causing the problem! but when i ran the container on a small cloud cluster, it ran as expected.

Smahane on 22 May 2018

The only thing I can see from the command line that looks like it _could_ possible be an issue is that the command is running an image from shub ... then from that running mpiexec using the nodes in ./nodelist and calling singularity on those nodes. (That or... it didn't parse out right ;) )

I'll see if I can replicate it at all in a VM setup.

jmstover on 22 May 2018

oh! that was just a formatting issue. There should be 2 steps: pull, then run. I would very much appreciate it if you can test it out.

Things use to work and the only thing i changed in my local cluster was updating Singularity to 2.5.1. That is why i though it's a defect. But when i tested it on a cloud cluster (updated Singularity as well), it works.

I don;t know what is different in my local cluster that is causing this error!
Very frustrating!

Smahane on 22 May 2018

I tested this issue on 3 different clusters with older Singularity version and had no issue. and updated Singularity to 2.5.1-dist version and got the same error above in all the 3 clusters.
So i confirm that this is reproducible defect when running with mpi.

I appreciate a fix.

Smahane on 24 May 2018

Just wondering if we should wait for a fix for this issue or downgrade to the previous version of Singularity.

Smahane on 30 May 2018

Hi @Smahane,

I haven't yet been able to reproduce this. :/

It looks like a socket is being closed, but that shouldn't be happening currently (@cclerget ?). Have you tried building from the current release-2.5 branch?

-J

jmstover on 30 May 2018

@jmstover I haven't tried the release-2.5 branch. I'm not sure if it will have the same issue too. I will just have go back to 2.4 release for our local cluster as that one was working with mpi.

We have a data center (we don't own) and we need to submit a request to them to change the Singularity version. I'm not sure which version to request now.

Smahane on 30 May 2018

There were some recent changes in the release-2.5 branch on how socket closing was done on startup. See this commit:

https://github.com/singularityware/singularity/commit/b0898b6952f34bb3f4b6440bb5821a30b13482ed

If it is the socket closing, the changes in close_fd() on that commit may be what you're looking for.

Building from the release-2.5 branch if you can would be your best option.

jmstover on 30 May 2018

@jmstover was this change tested at all?

Smahane on 30 May 2018

It's in the current release-2.5 branch but not in the 2.5.1 release. We've done normal testing... and I've done a basic mpi with it.

But I can say I have not done much beyond a simple run with hostname.

jmstover on 30 May 2018

release 2.5.0 has the same issue. 2.4.6 release doesn't .

Smahane on 31 May 2018

Hello

I got the same issue using Singularity 2.5.x, with Intel MPI (2016, 2018) as well as OpenMPI (2.0.2). The issue doesn't exist on 2.4.x versions. I don't know how to work around this issue (env? something related to PMI?). Please let me know if more inputs are needed.

regards
Christophe

bessonc on 2 Jun 2018

Hello,
I have found empirically that replacing in the singularity-2.5.1/src/util/util.c
line 478 close(fd);
by
line 478 continue;

The mpi error message disapper.

elasto on 5 Jun 2018

@elasto ... Hrmm, I haven't replicated the "same" issue. But I've ran into a mpi hang. I commented out the calls to fd_cleanup() and it still hung for me. But I haven't validated on my end if it's possible to be an mpi library interaction issue yet. :/

jmstover on 5 Jun 2018

Okay... I finally got it running after some environment hacking.

I can get a simple MPI job to run across multiple nodes on 2.5.1 proper (built from the 2.5.1 tag). My MPI job just prints out the host and ranking... i.e.

Hello from proc compute-0-9.local, rank 0 of 16
Hello from proc compute-0-10.local, rank 14 of 16

The "hang" I was getting was from incompatible MPI libraries between the host and image.

jmstover on 5 Jun 2018

@jmstover what environment hacks did you do?
Can you please clarify who runs the jobs? the host mpi or the image mpi?

Thanks,

Smahane on 5 Jun 2018

Here's my final command line:

SINGULARITYENV_LD_LIBRARY_PATH=/opt/openmpi/lib \
SINGULARITYENV_PREPEND_PATH=/opt/openmpi/bin \
  mpiexec -n 16 \
  /share/apps/singularity-2.5.1/bin/singularity exec \
  -B /opt/openmpi \
  centos7.img ~/mpitest

I needed to bind mount the hosts OpenMPI (/opt/openmpi) into the container in my case. The cluster has a version that's not from the stock repository, and it looked as if I was hitting an ABI issue between the versions in the container and the version on the host.

the host mpi or the image mpi

I ran it both ways... starting mpiexec on the host, and also from starting mpiexec from within the container. You will want to start it from the host if you are doing multinode, but if you're doing a single node MPI job, you should be able to start it from within the container without issue (Be sure to have SSH client in the container).

I also created a simple job script and submitted a job with SLURM and ran across two nodes. Since it's multinode, mpiexec ran Singularity.

jmstover on 5 Jun 2018

And here's job output ....

$ salloc -n 16
salloc: Granted job allocation 548
$ SINGULARITYENV_LD_LIBRARY_PATH=/opt/openmpi/lib SINGULARITYENV_PREPEND_PATH=/opt/openmpi/bin mpiexec -n 16 /share/apps/singularity-2.5.1/bin/singularity exec -B /opt/openmpi centos7.img ~/mpitest
Hello from proc compute-0-9.local, rank 0 of 16
Hello from proc compute-0-10.local, rank 9 of 16
Hello from proc compute-0-9.local, rank 4 of 16
Hello from proc compute-0-10.local, rank 11 of 16
Hello from proc compute-0-9.local, rank 1 of 16
Hello from proc compute-0-10.local, rank 14 of 16
Hello from proc compute-0-9.local, rank 5 of 16
Hello from proc compute-0-10.local, rank 10 of 16
Hello from proc compute-0-9.local, rank 7 of 16
Hello from proc compute-0-10.local, rank 12 of 16
Hello from proc compute-0-9.local, rank 2 of 16
Hello from proc compute-0-10.local, rank 13 of 16
Hello from proc compute-0-9.local, rank 3 of 16
Hello from proc compute-0-10.local, rank 8 of 16
Hello from proc compute-0-9.local, rank 6 of 16
Hello from proc compute-0-10.local, rank 15 of 16

jmstover on 5 Jun 2018

Hi, I have the same problem with intel MPI. I have attached a partial output of "strace -f" which appears to show that Intel's "pmi_proxy" creates a socketpair (line numbered 3051), but the fd is closed in the child process by Singularity's "action-suid" (line 4798). Hope this can help someone. Please tell me if you need more context from the strace output. I haven't tried using OpenMPI, the application and container I am working with needs to use intel MPI.

problem-2.5.1.txt

hakonenger on 6 Jun 2018

Hi,
As pointed by elasto, that seems to work for our images (for IMPI as well as OMPI) before this recent commit :
https://github.com/singularityware/singularity/commit/e80c9dd61932cfa1627400ccd1b5cb316556a1af

With the version 2.5.1, I made some retries on 2 nodes sharing a hello image :

KO with IMPI (using mpiexec.hydra)
KO for both IMPI and OMPI through srun
OK for OMPI via mpirun or salloc+mpirun (like you)

bessonc on 6 Jun 2018

@bessonc The release-2.5 branch _should_ fix this then... that code was changed, and sockets should only be closed if you run in a network namespace.

See commit https://github.com/singularityware/singularity/commit/caca8628f6479b35c7aa409c228dbfabb0a8a75d

jmstover on 6 Jun 2018

@jmstover Yes, it seems to work for me with Intel MPI with the release-2.5 branch. Thanks!

hakonenger on 7 Jun 2018

@hakonenger Good to hear that worked for you.

Anyone else who's having the issue still getting it when running from the release-2.5 branch?

jmstover on 7 Jun 2018

Hi folk,
I tested release-2.5 in a couple of MPI benchmarks compiled and executed at run-time with intel and openmpi. Both work fine

elasto on 9 Jun 2018

I will test this fix first thing tomorrow and update everyone.
Thank you

Smahane on 12 Jun 2018

Hi folks,

I have similar issues.
For Intel MPI, starting mpirun within the container works while starting mpirun on the host shows this error.
For OpenMPI, both work fine

I tested with 2.5.0 and 2.5.1.

Only 2.4.6 can help to solve this problem.

Would new release fix this problem?
Thanks

xuzheng97 on 3 Jul 2018

Hi @xuzheng97,

There should be a new release (2.5.2) within the next few days that contains the fixes from the release-2.5 branch that worked for others to get around the issue.

jmstover on 3 Jul 2018

Hi, @jmstover

Thanks

I have another question. Would you help to advise?
How to list running containers on the host?
And if I run 'mpirun -np 8 singularity exec xxx.exe', would 8 singularity containers run on the host or only 1 singularity with 8 process?

Thanks again

xuzheng97 on 3 Jul 2018

It would be 8 singularity containers. If you wanted 1 container, with 8 processes then you would do something like:

singularity exec some.img mpirun -n 8 xxx.exe

So... you run mpirun from the within the container, instead of using mpirun to execute singularity.

There isn't a good way to list running containers, as Singularity itself is out of the equation once it spawns the process. For example, this is what shows up as a singularity shell process:

jason    25531  0.0  0.0  23268  5492 pts/8    Ss   May25   0:00  \_ -bash
jason      744  0.3  0.0  19408  4412 pts/8    S+   02:43   0:00  |   \_ /bin/bash --norc

The singularity shell is the /bin/bash --norc process (pid 744), which is seen as just a child of my terminal bash shell. If you run singularity exec ... the process you run is seen as a child of your terminal. Singularity just exec's the process after setting everything up.

$ singularity exec ./ubuntu-latest.img ps auxwwf | grep -B1 'ps a' | grep -v grep
jason    25531  0.0  0.0  23268  5492 pts/8    Ss   May25   0:00  \_ -bash
jason      990  0.0  0.0  34712  3268 pts/8    R+   02:52   0:00  |   \_ ps auxwwf

Pid 990 is the ps command that's being ran with singularity exec ... and it's seen just as a child off of my terminal bash process.

jmstover on 3 Jul 2018

@jmstover
Thanks a lot for your explaination.

If I run mpirun within the container, it seems very difficult to run on multinode. I googled and somebody said to use ssh wrapper.....

Thanks again

xuzheng97 on 3 Jul 2018

👍2

Yes, if you start it from within the container doing multinode is going to be painful. When MPI ssh's to another node to start the process it's not within a container when it SSH's.... and that's where the wrapper comes in. It needs to start up a container with the right image and drop the SSH connection into it.

But, IMO, doing it that way works better if you have a MPI job that runs on a _single_ node. Having mpiexec calling singularity is better for multinode.

jmstover on 3 Jul 2018

@jmstover : this explanation is very interesting. If "mpirun -nX singularity exec ..." spawns multiple singularity processes, does it mean MPI cannot use shm for intra-nodes communications between those processes ? (which would say the pmi_proxy does this job?)

bessonc on 3 Jul 2018

Hi @bessonc,

No, /dev/shm can be used (depending on options). By default the host /dev is bind mounted, so that's available to you.

The reason I say I prefer running MPI from within the container to having mpiexec execute singularity on a single node is there isn't a reason for multiple access hits to a image file... or taking up multiple loop devices on a single node run. So, don't run multiple of singularity, but run the mpi process from within the container. The behavior should be the same either way.

jmstover on 3 Jul 2018

👍1

Hi @jmstover

Thank you for your reply. I do not refer to /dev/shm but to shared memory used for intra-process communication on a same node (http://man7.org/linux/man-pages/man7/shm_overview.7.html), used as underlying comm by MPI (I_MPI_FABRICS=shm is the default for intra-node). In my use case, I run singularity over multiple nodes, with multiple process per node (and multiple OMP threads, depending on the CPU arch). So I need to "mpirun singularity exec ..." to do this. If multiple containers are spawned on a given host, I believe there will be one user namespace / process, so shm_*() calls won't work between those processes (??). I didn't succeed to identify any process related to Singularity itself, but I tried 2 things on a single node : mpirun and salloc+mpirun. In both case, both processes seems to be inherited from the same PPID (mpiexec.hydra or slurmstepd). Many thanks in advance for any enlightening :)

mpirun + singularity exec

$ mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10

PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1441 26907 26907 26907 ? -1 Ss 0 0:00 _ sshd: bessonc [priv]
26907 26909 26907 26907 ? -1 S 50133 0:00 | _ sshd: bessonc@pts/3
26909 26910 26910 26910 pts/3 27571 Ss 50133 0:00 | _ -bash
26910 27571 27571 26910 pts/3 27571 S+ 50133 0:00 | _ /bin/sh /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27571 27576 27571 26910 pts/3 27571 S+ 50133 0:00 | _ mpiexec.hydra -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27576 27577 27577 26910 pts/3 27571 S 50133 0:00 | _ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:42238 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2005308951 --usize -2 --proxy-id 0
27577 27581 27581 26910 pts/3 27571 Rl 50133 2:41 | _ xhpcg_avx --n 192 --t 10
27577 27582 27582 26910 pts/3 27571 Rl 50133 2:41 | _ xhpcg_avx --n 192 --t 10

salloc + mpirun + singularity exec

$ salloc --reservation=sing --exclusive -N1 -n2 -w mo73 mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10

PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
25345 25346 25346 25346 pts/0 25346 Ss+ 16190 0:00 | _ -bash
1441 26907 26907 26907 ? -1 Ss 0 0:00 _ sshd: bessonc [priv]
26907 26909 26907 26907 ? -1 S 50133 0:00 | _ sshd: bessonc@pts/3
26909 26910 26910 26910 pts/3 27629 Ss 50133 0:00 | _ -bash
26910 27627 27627 26910 pts/3 27629 Sl 50133 0:00 | _ salloc --reservation=sing --exclusive -N1 -n2 -w mo73 mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27627 27629 27629 26910 pts/3 27629 S+ 50133 0:00 | _ /bin/sh /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27629 27634 27629 26910 pts/3 27629 S+ 50133 0:00 | _ mpiexec.hydra -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27634 27635 27635 26910 pts/3 27629 Sl 50133 0:00 | _ /usr/bin/srun --nodelist mo73 -N 1 -n 1 --input none /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
27635 27636 27635 26910 pts/3 27629 S 50133 0:00 | _ /usr/bin/srun --nodelist mo73 -N 1 -n 1 --input none /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
1 27646 27645 27645 ? -1 Sl 0 0:00 slurmstepd: [1066.0]
27646 27652 27652 27645 ? -1 S 50133 0:00 _ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
27652 27659 27659 27645 ? -1 Rl 50133 4:50 _ xhpcg_avx --n 192 --t 10
27652 27660 27660 27645 ? -1 Rl 50133 4:51 _ xhpcg_avx --n 192 --t 10

bessonc on 3 Jul 2018

I believe there will be one user namespace / process

By default we are not spawning into a new user or PID namespace in Singularity 2.x. We are thinking of changing this behavior in 3.x though.

Currently you need to use --contain/--containall, or the -p option if you are wanting pid namespace. And the -u option for user namespace.

jmstover on 3 Jul 2018

Ok, thank you again. When I use -p (for PID ns), the process tree shows 2 HPCG processes spawned by 2 differents singularity action-suid bin, and MPI complains...

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(201)...................: MPI_Send(buf=0x7fdbf40c2010, count=36864, MPI_INT, dest=0, tag=98, MPI_COMM_WORLD) failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(302): fail failed
dcp_recv(165)...................: Internal MPI error!  Cannot read from remote process
 Two workarounds have been identified for this issue:
 1) Enable ptrace for non-root users with:
    echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
 2) Or, use:
    I_MPI_SHM_LMT=shm

However, adding the last IMPI env suggested in the error message works, and I didn't notice any perf collapse at this step (!)

    1 29274 29273 29273 ?           -1 Sl       0   0:00 slurmstepd: [1078.0]
29274 29280 29280 29273 ?           -1 S    50133   0:00  \_ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:40198 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1179907775 --usize -2 --proxy-id -1
29280 29287 29287 29273 ?           -1 S    50133   0:00      \_ /usr/libexec/singularity/bin/action-suid xhpcg_avx --n 192 --t 10
29287 29293 29287 29273 ?           -1 Rl   50133   9:19      |   \_ xhpcg_avx --n 192 --t 10
29280 29288 29288 29273 ?           -1 S    50133   0:00      \_ /usr/libexec/singularity/bin/action-suid xhpcg_avx --n 192 --t 10
29288 29291 29288 29273 ?           -1 Rl   50133   9:18          \_ xhpcg_avx --n 192 --t 10

bessonc on 3 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings