2.5.1-dist
Write here.
successfully run the container on a cluster and produce similar results as running the binary
Write here.
mpiexec.hydra -hostfile ./nodelist -n 4 -ppn 2 singularity run --app multinode namd.multi.img stmv 39
This is what happens when you run the namd multinode container...
running stmv workoad now
This is what happens when you run the namd multinode container...
running stmv workoad now
[cli_1]: write_line error; fd=16 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_1]: Unable to write to PMI_fd
[cli_1]: write_line error; fd=16 buf=:cmd=barrier_in
:
system msg for write_line failure : Bad file descriptor
[cli_1]: write_line error; fd=16 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : Bad file descriptor
[cli_1]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[cli_1]: write_line error; fd=16 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : Bad file descriptor
all done!
[cli_0]: write_line error; fd=14 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=14 buf=:cmd=barrier_in
:
system msg for write_line failure : Bad file descriptor
[cli_0]: write_line error; fd=14 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : Bad file descriptor
[cli_0]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[cli_0]: write_line error; fd=14 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : Bad file descriptor
all done!
This is what happens when you run the namd multinode container...
This is what happens when you run the namd multinode container...
Write here.
Config file:
https://github.com/intel/Intel-HPC-Container/blob/master/containers/namd/Singularity.namd
$ singularity pull --name namd.multi.img shub://intel/Intel-HPC-Container:multi.namd
$ mpiexec.hydra -hostfile ./nodelist -n 4 -ppn 2 singularity run --app multinode namd.multi.img stmv 39
Hi @Smahane ... Were you able to work around this issue?
After farther testing, it appears to be an environment issue not a Singularity issue.
Thanks
@jmstover unfortunately I have't figured out how to work around this yet or what is causing the problem! but when i ran the container on a small cloud cluster, it ran as expected.
The only thing I can see from the command line that looks like it _could_ possible be an issue is that the command is running an image from shub ... then from that running mpiexec using the nodes in ./nodelist and calling singularity on those nodes. (That or... it didn't parse out right ;) )
I'll see if I can replicate it at all in a VM setup.
oh! that was just a formatting issue. There should be 2 steps: pull, then run. I would very much appreciate it if you can test it out.
Things use to work and the only thing i changed in my local cluster was updating Singularity to 2.5.1. That is why i though it's a defect. But when i tested it on a cloud cluster (updated Singularity as well), it works.
I don;t know what is different in my local cluster that is causing this error!
Very frustrating!
I tested this issue on 3 different clusters with older Singularity version and had no issue. and updated Singularity to 2.5.1-dist version and got the same error above in all the 3 clusters.
So i confirm that this is reproducible defect when running with mpi.
I appreciate a fix.
Just wondering if we should wait for a fix for this issue or downgrade to the previous version of Singularity.
Hi @Smahane,
I haven't yet been able to reproduce this. :/
It looks like a socket is being closed, but that shouldn't be happening currently (@cclerget ?). Have you tried building from the current release-2.5 branch?
-J
@jmstover I haven't tried the release-2.5 branch. I'm not sure if it will have the same issue too. I will just have go back to 2.4 release for our local cluster as that one was working with mpi.
We have a data center (we don't own) and we need to submit a request to them to change the Singularity version. I'm not sure which version to request now.
There were some recent changes in the release-2.5 branch on how socket closing was done on startup. See this commit:
https://github.com/singularityware/singularity/commit/b0898b6952f34bb3f4b6440bb5821a30b13482ed
If it is the socket closing, the changes in close_fd() on that commit may be what you're looking for.
Building from the release-2.5 branch if you can would be your best option.
@jmstover was this change tested at all?
It's in the current release-2.5 branch but not in the 2.5.1 release. We've done normal testing... and I've done a basic mpi with it.
But I can say I have not done much beyond a simple run with hostname.
release 2.5.0 has the same issue. 2.4.6 release doesn't .
Hello
I got the same issue using Singularity 2.5.x, with Intel MPI (2016, 2018) as well as OpenMPI (2.0.2). The issue doesn't exist on 2.4.x versions. I don't know how to work around this issue (env? something related to PMI?). Please let me know if more inputs are needed.
regards
Christophe
Hello,
I have found empirically that replacing in the singularity-2.5.1/src/util/util.c
line 478 close(fd);
by
line 478 continue;
The mpi error message disapper.
@elasto ... Hrmm, I haven't replicated the "same" issue. But I've ran into a mpi hang. I commented out the calls to fd_cleanup() and it still hung for me. But I haven't validated on my end if it's possible to be an mpi library interaction issue yet. :/
Okay... I finally got it running after some environment hacking.
I can get a simple MPI job to run across multiple nodes on 2.5.1 proper (built from the 2.5.1 tag). My MPI job just prints out the host and ranking... i.e.
Hello from proc compute-0-9.local, rank 0 of 16
Hello from proc compute-0-10.local, rank 14 of 16
The "hang" I was getting was from incompatible MPI libraries between the host and image.
@jmstover what environment hacks did you do?
Can you please clarify who runs the jobs? the host mpi or the image mpi?
Thanks,
Here's my final command line:
SINGULARITYENV_LD_LIBRARY_PATH=/opt/openmpi/lib \
SINGULARITYENV_PREPEND_PATH=/opt/openmpi/bin \
mpiexec -n 16 \
/share/apps/singularity-2.5.1/bin/singularity exec \
-B /opt/openmpi \
centos7.img ~/mpitest
I needed to bind mount the hosts OpenMPI (/opt/openmpi) into the container in my case. The cluster has a version that's not from the stock repository, and it looked as if I was hitting an ABI issue between the versions in the container and the version on the host.
the host mpi or the image mpi
I ran it both ways... starting mpiexec on the host, and also from starting mpiexec from within the container. You will want to start it from the host if you are doing multinode, but if you're doing a single node MPI job, you should be able to start it from within the container without issue (Be sure to have SSH client in the container).
I also created a simple job script and submitted a job with SLURM and ran across two nodes. Since it's multinode, mpiexec ran Singularity.
And here's job output ....
$ salloc -n 16
salloc: Granted job allocation 548
$ SINGULARITYENV_LD_LIBRARY_PATH=/opt/openmpi/lib SINGULARITYENV_PREPEND_PATH=/opt/openmpi/bin mpiexec -n 16 /share/apps/singularity-2.5.1/bin/singularity exec -B /opt/openmpi centos7.img ~/mpitest
Hello from proc compute-0-9.local, rank 0 of 16
Hello from proc compute-0-10.local, rank 9 of 16
Hello from proc compute-0-9.local, rank 4 of 16
Hello from proc compute-0-10.local, rank 11 of 16
Hello from proc compute-0-9.local, rank 1 of 16
Hello from proc compute-0-10.local, rank 14 of 16
Hello from proc compute-0-9.local, rank 5 of 16
Hello from proc compute-0-10.local, rank 10 of 16
Hello from proc compute-0-9.local, rank 7 of 16
Hello from proc compute-0-10.local, rank 12 of 16
Hello from proc compute-0-9.local, rank 2 of 16
Hello from proc compute-0-10.local, rank 13 of 16
Hello from proc compute-0-9.local, rank 3 of 16
Hello from proc compute-0-10.local, rank 8 of 16
Hello from proc compute-0-9.local, rank 6 of 16
Hello from proc compute-0-10.local, rank 15 of 16
Hi, I have the same problem with intel MPI. I have attached a partial output of "strace -f" which appears to show that Intel's "pmi_proxy" creates a socketpair (line numbered 3051), but the fd is closed in the child process by Singularity's "action-suid" (line 4798). Hope this can help someone. Please tell me if you need more context from the strace output. I haven't tried using OpenMPI, the application and container I am working with needs to use intel MPI.
Hi,
As pointed by elasto, that seems to work for our images (for IMPI as well as OMPI) before this recent commit :
https://github.com/singularityware/singularity/commit/e80c9dd61932cfa1627400ccd1b5cb316556a1af
With the version 2.5.1, I made some retries on 2 nodes sharing a hello image :
@bessonc The release-2.5 branch _should_ fix this then... that code was changed, and sockets should only be closed if you run in a network namespace.
See commit https://github.com/singularityware/singularity/commit/caca8628f6479b35c7aa409c228dbfabb0a8a75d
@jmstover Yes, it seems to work for me with Intel MPI with the release-2.5 branch. Thanks!
@hakonenger Good to hear that worked for you.
Anyone else who's having the issue still getting it when running from the release-2.5 branch?
Hi folk,
I tested release-2.5 in a couple of MPI benchmarks compiled and executed at run-time with intel and openmpi. Both work fine
I will test this fix first thing tomorrow and update everyone.
Thank you
Hi folks,
I have similar issues.
For Intel MPI, starting mpirun within the container works while starting mpirun on the host shows this error.
For OpenMPI, both work fine
I tested with 2.5.0 and 2.5.1.
Only 2.4.6 can help to solve this problem.
Would new release fix this problem?
Thanks
Hi @xuzheng97,
There should be a new release (2.5.2) within the next few days that contains the fixes from the release-2.5 branch that worked for others to get around the issue.
Hi, @jmstover
Thanks
I have another question. Would you help to advise?
How to list running containers on the host?
And if I run 'mpirun -np 8 singularity exec xxx.exe', would 8 singularity containers run on the host or only 1 singularity with 8 process?
Thanks again
It would be 8 singularity containers. If you wanted 1 container, with 8 processes then you would do something like:
singularity exec some.img mpirun -n 8 xxx.exe
So... you run mpirun from the within the container, instead of using mpirun to execute singularity.
There isn't a good way to list running containers, as Singularity itself is out of the equation once it spawns the process. For example, this is what shows up as a singularity shell process:
jason 25531 0.0 0.0 23268 5492 pts/8 Ss May25 0:00 \_ -bash
jason 744 0.3 0.0 19408 4412 pts/8 S+ 02:43 0:00 | \_ /bin/bash --norc
The singularity shell is the /bin/bash --norc process (pid 744), which is seen as just a child of my terminal bash shell. If you run singularity exec ... the process you run is seen as a child of your terminal. Singularity just exec's the process after setting everything up.
$ singularity exec ./ubuntu-latest.img ps auxwwf | grep -B1 'ps a' | grep -v grep
jason 25531 0.0 0.0 23268 5492 pts/8 Ss May25 0:00 \_ -bash
jason 990 0.0 0.0 34712 3268 pts/8 R+ 02:52 0:00 | \_ ps auxwwf
Pid 990 is the ps command that's being ran with singularity exec ... and it's seen just as a child off of my terminal bash process.
@jmstover
Thanks a lot for your explaination.
If I run mpirun within the container, it seems very difficult to run on multinode. I googled and somebody said to use ssh wrapper.....
Thanks again
Yes, if you start it from within the container doing multinode is going to be painful. When MPI ssh's to another node to start the process it's not within a container when it SSH's.... and that's where the wrapper comes in. It needs to start up a container with the right image and drop the SSH connection into it.
But, IMO, doing it that way works better if you have a MPI job that runs on a _single_ node. Having mpiexec calling singularity is better for multinode.
@jmstover : this explanation is very interesting. If "mpirun -nX singularity exec ..." spawns multiple singularity processes, does it mean MPI cannot use shm for intra-nodes communications between those processes ? (which would say the pmi_proxy does this job?)
Hi @bessonc,
No, /dev/shm can be used (depending on options). By default the host /dev is bind mounted, so that's available to you.
The reason I say I prefer running MPI from within the container to having mpiexec execute singularity on a single node is there isn't a reason for multiple access hits to a image file... or taking up multiple loop devices on a single node run. So, don't run multiple of singularity, but run the mpi process from within the container. The behavior should be the same either way.
Hi @jmstover
Thank you for your reply. I do not refer to /dev/shm but to shared memory used for intra-process communication on a same node (http://man7.org/linux/man-pages/man7/shm_overview.7.html), used as underlying comm by MPI (I_MPI_FABRICS=shm is the default for intra-node). In my use case, I run singularity over multiple nodes, with multiple process per node (and multiple OMP threads, depending on the CPU arch). So I need to "mpirun singularity exec ..." to do this. If multiple containers are spawned on a given host, I believe there will be one user namespace / process, so shm_*() calls won't work between those processes (??). I didn't succeed to identify any process related to Singularity itself, but I tried 2 things on a single node : mpirun and salloc+mpirun. In both case, both processes seems to be inherited from the same PPID (mpiexec.hydra or slurmstepd). Many thanks in advance for any enlightening :)
$ mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1441 26907 26907 26907 ? -1 Ss 0 0:00 _ sshd: bessonc [priv]
26907 26909 26907 26907 ? -1 S 50133 0:00 | _ sshd: bessonc@pts/3
26909 26910 26910 26910 pts/3 27571 Ss 50133 0:00 | _ -bash
26910 27571 27571 26910 pts/3 27571 S+ 50133 0:00 | _ /bin/sh /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27571 27576 27571 26910 pts/3 27571 S+ 50133 0:00 | _ mpiexec.hydra -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27576 27577 27577 26910 pts/3 27571 S 50133 0:00 | _ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:42238 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2005308951 --usize -2 --proxy-id 0
27577 27581 27581 26910 pts/3 27571 Rl 50133 2:41 | _ xhpcg_avx --n 192 --t 10
27577 27582 27582 26910 pts/3 27571 Rl 50133 2:41 | _ xhpcg_avx --n 192 --t 10
$ salloc --reservation=sing --exclusive -N1 -n2 -w mo73 mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
25345 25346 25346 25346 pts/0 25346 Ss+ 16190 0:00 | _ -bash
1441 26907 26907 26907 ? -1 Ss 0 0:00 _ sshd: bessonc [priv]
26907 26909 26907 26907 ? -1 S 50133 0:00 | _ sshd: bessonc@pts/3
26909 26910 26910 26910 pts/3 27629 Ss 50133 0:00 | _ -bash
26910 27627 27627 26910 pts/3 27629 Sl 50133 0:00 | _ salloc --reservation=sing --exclusive -N1 -n2 -w mo73 mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27627 27629 27629 26910 pts/3 27629 S+ 50133 0:00 | _ /bin/sh /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpirun -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27629 27634 27629 26910 pts/3 27629 S+ 50133 0:00 | _ mpiexec.hydra -genv OMP_NUM_THREADS=8 -np 2 singularity exec /scratch/bessonc/xhpcg_avx_impi2018.simg xhpcg_avx --n 192 --t 10
27634 27635 27635 26910 pts/3 27629 Sl 50133 0:00 | _ /usr/bin/srun --nodelist mo73 -N 1 -n 1 --input none /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
27635 27636 27635 26910 pts/3 27629 S 50133 0:00 | _ /usr/bin/srun --nodelist mo73 -N 1 -n 1 --input none /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
1 27646 27645 27645 ? -1 Sl 0 0:00 slurmstepd: [1066.0]
27646 27652 27652 27645 ? -1 S 50133 0:00 _ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:45375 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 618750365 --usize -2 --proxy-id -1
27652 27659 27659 27645 ? -1 Rl 50133 4:50 _ xhpcg_avx --n 192 --t 10
27652 27660 27660 27645 ? -1 Rl 50133 4:51 _ xhpcg_avx --n 192 --t 10
I believe there will be one user namespace / process
By default we are not spawning into a new user or PID namespace in Singularity 2.x. We are thinking of changing this behavior in 3.x though.
Currently you need to use --contain/--containall, or the -p option if you are wanting pid namespace. And the -u option for user namespace.
Ok, thank you again. When I use -p (for PID ns), the process tree shows 2 HPCG processes spawned by 2 differents singularity action-suid bin, and MPI complains...
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(201)...................: MPI_Send(buf=0x7fdbf40c2010, count=36864, MPI_INT, dest=0, tag=98, MPI_COMM_WORLD) failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(302): fail failed
dcp_recv(165)...................: Internal MPI error! Cannot read from remote process
Two workarounds have been identified for this issue:
1) Enable ptrace for non-root users with:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
2) Or, use:
I_MPI_SHM_LMT=shm
However, adding the last IMPI env suggested in the error message works, and I didn't notice any perf collapse at this step (!)
1 29274 29273 29273 ? -1 Sl 0 0:00 slurmstepd: [1078.0]
29274 29280 29280 29273 ? -1 S 50133 0:00 \_ /opt/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/pmi_proxy --control-port mo73.echi:40198 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1179907775 --usize -2 --proxy-id -1
29280 29287 29287 29273 ? -1 S 50133 0:00 \_ /usr/libexec/singularity/bin/action-suid xhpcg_avx --n 192 --t 10
29287 29293 29287 29273 ? -1 Rl 50133 9:19 | \_ xhpcg_avx --n 192 --t 10
29280 29288 29288 29273 ? -1 S 50133 0:00 \_ /usr/libexec/singularity/bin/action-suid xhpcg_avx --n 192 --t 10
29288 29291 29288 29273 ? -1 Rl 50133 9:18 \_ xhpcg_avx --n 192 --t 10
Most helpful comment
@jmstover
Thanks a lot for your explaination.
If I run mpirun within the container, it seems very difficult to run on multinode. I googled and somebody said to use ssh wrapper.....
Thanks again