The following bug was discovered by @Anton-Le :
The radiation plugin writes out hdf5 data for checkpointing and as an advanced output format that still contains the phase of the emitted radiation. However, if --<species>_radiation.totalRadiation is is not set, the output is not generated - even the directory for the regular hdf5 output is not generated during startup. With --<species>_radiation.totalRadiation enabled, everything works fine.
Thanks @Anton-Le for finding this bug.
Interestingly, for @Anton-Le the simulation just hung and did not crash.
I tested it on the k20 partition of hemera at HZDR and there it crashed with the following error message:
[1,0]<stderr>:[SPLASH_LOG:0] Exception: Exception for HandleMgr::get: Failed to create file (radiationHDF5/e_radAmplitudes_2800_0_0_0.h5)
[1,0]<stderr>:Unhandled exception of type 'N6splash11DCExceptionE' with message 'Exception for HandleMgr::get: Failed to create file (radiationHDF5/e_radAmplitudes_2800_0_0_0.h5)', terminating
[1,0]<stderr>:[kepler003:28937] *** Process received signal ***
[1,0]<stderr>:[kepler003:28937] Signal: Segmentation fault (11)
[1,0]<stderr>:[kepler003:28937] Signal code: (128)
[1,0]<stderr>:[kepler003:28937] Failing at address: (nil)
Update:
I just spoke offline with @Anton-Le - the simulation did not crash but just stopped working with mvapich as MPI. With openmpi it did crash after encountering an error. (According to @Anton-Le mvapich should communicate with the SLURM scheduler when encountering an error - which seems to fail and causes the hanging.)
ping @PrometheusPi: can be closed with #3021 merged?
@ax3l Thanks for the reminder.
Since some resources were free on the system I could run a comparison on I tried to reproduce the problem with MVAPICH and running on CPUs only (no GPUs in the machine).
There the program crashed, too and SLURM terminated the job correctly with the errors
terminate called after throwing an instance of 'splash::DCException'
what(): Exception for HandleMgr::get: Failed to create file (radiationHDF5/b_radAmplitudes_12500_0_0_0.h5)
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1283).....................: MPI_Reduce(sbuf=0x2748020, rbuf=0x274c2d0, count=1026, MPI_DOUBLE, MPI_SUM, root=0, comm=0xc400000a) failed
MPIR_Reduce_impl(1085)................:
MPIR_Reduce_intra(883)................:
MPIR_Reduce_redscat_gather(484).......:
MPIDU_Complete_posted_with_error(1152): Process failed
MPIR_Reduce_redscat_gather(620).......:
MPIC_Send(371)........................:
MPID_Send(95).........................: Communication error with rank 0
So the hanging part is probably due to SLURM. I'm thinking of putting it as a bug to the IT dept, pending further tests.
@Anton-Le Thanks for the update and for having a look into this SLURM/MPI issue.
Most helpful comment
Since some resources were free on the system I could run a comparison on I tried to reproduce the problem with MVAPICH and running on CPUs only (no GPUs in the machine).
There the program crashed, too and SLURM terminated the job correctly with the errors
terminate called after throwing an instance of 'splash::DCException' what(): Exception for HandleMgr::get: Failed to create file (radiationHDF5/b_radAmplitudes_12500_0_0_0.h5) Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1283).....................: MPI_Reduce(sbuf=0x2748020, rbuf=0x274c2d0, count=1026, MPI_DOUBLE, MPI_SUM, root=0, comm=0xc400000a) failed MPIR_Reduce_impl(1085)................: MPIR_Reduce_intra(883)................: MPIR_Reduce_redscat_gather(484).......: MPIDU_Complete_posted_with_error(1152): Process failed MPIR_Reduce_redscat_gather(620).......: MPIC_Send(371)........................: MPID_Send(95).........................: Communication error with rank 0So the hanging part is probably due to SLURM. I'm thinking of putting it as a bug to the IT dept, pending further tests.