I let simulations run on the ml partition at taurus@TU Dresden. These are Power9 nodes equipped with V100 GPUs.
I experienced now three times the same error when trying to run a specific simulation:
terminate called after throwing an instance of 'std::runtime_error'
what(): /<Path to source>/picongpu/thirdParty/alpaka/include/alpaka/stream/StreamCudaRtAsync.hpp(351) 'cudaStreamSynchronize( stream.m_spStreamImpl->m_CudaStream)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
It appears after different run times. For instance, one time after 15k steps the other after 40k steps. Funnily, whenever it happens, I cannot restart the simulation from the last (ADIOS) checkpoint, since the checkpoint is broken. Although it was written 5000 time steps before the simulation crashed. The corresponding error message is
ERROR: Variable '/data/10000/particles/e/position/x' is not found!
ERROR: Variable '/data/10000/particles/e/position/x' is not found!
Note, I could restart the simulation 2 days ago when it ended successful at a given time step and I had running simulations after the first time this error appeared. Additionaly, I use the whole cluster. At least 4 out of 6 GPUs per node due to too few host memory. Therefore I do not really think that this is a GPU specific problem.
I am clueless at the moment. Can anyone help?
As I understand it, this message arises when a kernel crashed due to a wide variety of reasons: starting from an illegal memory access (e.g. due to indexing bug or uninitialized variable or data race), to hardware issues and being killed off by the operating system or driver. So unfortunately it does not give much.
With checkpoints also being broken, that is an interesting observation. Theoretically an explanation could be that checkpointing implementation is somehow fragile towards crashes (e.g something is not flushed and if a crash occurs the file is not in a valid state). However, from the code I did not see any confirmation it actually happens.
Looks to me like, e.g. your I/O fails due to a full filesystem (we checked at OSTs are at 98%).
Try adding -DPMACC_BLOCKING_KERNEL=ON to trigger the error immediately with the kernel that fails:
https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging#add-debug-flags-to-the-code
By default, you will see such errors around checkpoints since we synchronize all open events during a checkpoint to make sure a written and closed checkpoint is indeed in restartable state. At this point, CUDA errors are checked.
Well, it now became pretty obvious that this is due to a full file system. Whenever I deleted data, simulations run again. I will try the blocking kernel debug flag when I experience the error again. Thanks!
ADDITION
The error message
ERROR: Variable '/data/10000/particles/e/position/x' is not found!
was caused by a different fault. I created the ADIOS .bp meta files using the command
bpmeta -z checkpoint_<step>.bp
where the -z option removes empty data sets (what I did not know :disappointed:). Since the particle species e belonged to a density distribution which was not in the simulation window, the data set was not mentioned in the meta file.
Creating the meta file by
bpmeta checkpoint_<step>.bp
allowed to restart from the checkpoint.
However, I was told that creating the meta file with the -z option was necessary for simData files earlier in order to read it.
Most helpful comment
ADDITION
The error message
was caused by a different fault. I created the ADIOS
.bpmeta files using the commandwhere the
-zoption removes empty data sets (what I did not know :disappointed:). Since the particle speciesebelonged to a density distribution which was not in the simulation window, the data set was not mentioned in the meta file.Creating the meta file by
allowed to restart from the checkpoint.
However, I was told that creating the meta file with the
-zoption was necessary for simData files earlier in order to read it.