I am currently seeing in about 1 of ~10 simulations a crash during initialization with current dev as of 0577bae174919ac5c0fbfec5b25f9c7dc0f0772d .
Looks like a memory violation.
It manifests e.g. in a free(): corrupted unsorted chunks in
picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::pluginLoad()
A multi-species ion simulation without ionization dynamics.
Hypnos (HZDR) with CUDA 8 backend on K80 (16 devices / sim).
# Core Dependencies
module load gcc/4.9.2
module load cmake/3.7.2
module load boost/1.62.0
module load cuda/8.0
module load openmpi/1.8.6.kepler.cuda80
# Plugins (optional)
module load pngwriter/0.7.0
module load hdf5-parallel/1.8.15 libsplash/1.6.0
cc @psychocoderHPC
On K8p or K20? If it is on K80 it could be an cluster issue. We have currently MPI issues on K80.
Might be, I only saw the issues at kepler024 and kepler025 as far as I remember.
@ax3l I added you on the ticket for hypnos
I had similar problems with the Radiation plugin enabled, but with a chance of ~ 1:2 of PIConGPU crashing right at the beginning. If it runs, it runs well without errors.
@theZiz I see the same random behavior. I started to check this with a default LWFA setup (without the radiation plugin) on Friday and the issue persisted. I added you to the ticket as well.
I have the impression that the chance of an initial crash or hanging increases with the number of nodes involved.
I can confirm this. While testing for the Evaluation (on a part of the cluster) everything worked fine most of the time, but using the whole cluster the sh*t hit the fan.
Issue persists today. Roughly at the same failure rate.
It looks like it is only an issue of the k80 queue.
Remembering the strange steering issues I had for the last live visualization of PIConGPU? It also only happened on the k80 queue.
<-- sad little fellow 馃樋
I'll leave this open for self-help and free hugs until the issue is fixed
:hugs: :hugs: :hugs: :hugs: :bear:
no problems after maintenance