Picongpu: 'unspecified launch failure' during simulation causes 'std::runtime_error'

Created on 2 May 2019  路  4Comments  路  Source: ComputationalRadiationPhysics/picongpu

I let simulations run on the ml partition at taurus@TU Dresden. These are Power9 nodes equipped with V100 GPUs.

I experienced now three times the same error when trying to run a specific simulation:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /<Path to source>/picongpu/thirdParty/alpaka/include/alpaka/stream/StreamCudaRtAsync.hpp(351) 'cudaStreamSynchronize( stream.m_spStreamImpl->m_CudaStream)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!

It appears after different run times. For instance, one time after 15k steps the other after 40k steps. Funnily, whenever it happens, I cannot restart the simulation from the last (ADIOS) checkpoint, since the checkpoint is broken. Although it was written 5000 time steps before the simulation crashed. The corresponding error message is

ERROR: Variable '/data/10000/particles/e/position/x' is not found!
ERROR: Variable '/data/10000/particles/e/position/x' is not found!

Note, I could restart the simulation 2 days ago when it ended successful at a given time step and I had running simulations after the first time this error appeared. Additionaly, I use the whole cluster. At least 4 out of 6 GPUs per node due to too few host memory. Therefore I do not really think that this is a GPU specific problem.

I am clueless at the moment. Can anyone help?

plugin machinsystem question

Most helpful comment

ADDITION

The error message

ERROR: Variable '/data/10000/particles/e/position/x' is not found!

was caused by a different fault. I created the ADIOS .bp meta files using the command

bpmeta -z checkpoint_<step>.bp

where the -z option removes empty data sets (what I did not know :disappointed:). Since the particle species e belonged to a density distribution which was not in the simulation window, the data set was not mentioned in the meta file.
Creating the meta file by

bpmeta checkpoint_<step>.bp

allowed to restart from the checkpoint.

However, I was told that creating the meta file with the -z option was necessary for simData files earlier in order to read it.

All 4 comments

As I understand it, this message arises when a kernel crashed due to a wide variety of reasons: starting from an illegal memory access (e.g. due to indexing bug or uninitialized variable or data race), to hardware issues and being killed off by the operating system or driver. So unfortunately it does not give much.

With checkpoints also being broken, that is an interesting observation. Theoretically an explanation could be that checkpointing implementation is somehow fragile towards crashes (e.g something is not flushed and if a crash occurs the file is not in a valid state). However, from the code I did not see any confirmation it actually happens.

Looks to me like, e.g. your I/O fails due to a full filesystem (we checked at OSTs are at 98%).

Try adding -DPMACC_BLOCKING_KERNEL=ON to trigger the error immediately with the kernel that fails:
https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging#add-debug-flags-to-the-code

By default, you will see such errors around checkpoints since we synchronize all open events during a checkpoint to make sure a written and closed checkpoint is indeed in restartable state. At this point, CUDA errors are checked.

Well, it now became pretty obvious that this is due to a full file system. Whenever I deleted data, simulations run again. I will try the blocking kernel debug flag when I experience the error again. Thanks!

ADDITION

The error message

ERROR: Variable '/data/10000/particles/e/position/x' is not found!

was caused by a different fault. I created the ADIOS .bp meta files using the command

bpmeta -z checkpoint_<step>.bp

where the -z option removes empty data sets (what I did not know :disappointed:). Since the particle species e belonged to a density distribution which was not in the simulation window, the data set was not mentioned in the meta file.
Creating the meta file by

bpmeta checkpoint_<step>.bp

allowed to restart from the checkpoint.

However, I was told that creating the meta file with the -z option was necessary for simData files earlier in order to read it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ax3l picture ax3l  路  3Comments

berceanu picture berceanu  路  4Comments

ax3l picture ax3l  路  4Comments

bussmann picture bussmann  路  4Comments

HighIander picture HighIander  路  4Comments