Picongpu: runtime error on topic-collisions "illegal memory access"

Created on 13 Oct 2020  路  26Comments  路  Source: ComputationalRadiationPhysics/picongpu

got runtime error:

Unhandled exception of type 'St13runtime_error' with message '/home/paschk31/src/picongpu/thirdParty/cupla/alpaka/include/alpaka/event/EventUniformCudaHipRt.hpp(198) 'ret = ALPAKA_API_PREFIX(EventQuery)( event.m_spEventImpl->m_UniformCudaHipEvent)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!', terminating

occured while being on branch(@pordyna):
https://github.com/pordyna/picongpu/tree/topic-collisions

first made a simulation without the Collision.param, which went smoothly
then made a simulation with Collision.param, which gave the above error (param and cfg are identical to before)

already discussed with @n01r

All 26 comments

Thanks for reporting @paschk31 . Could you provide your exact setup: which commit was used, and which .param and .cfg files. We can enable some debug checks and that hopefully would give more info on what went wrong.

Edit: also, on which system did it happened. I assume one of Hemera partitions?

@sbastrakov yes like u guessed it happend on hemera5
commit: f2c450e1d5e934d065cb475a9fdceb2d1819893d
input set: input_collision.zip

actually the commit was: 491daef0e7f36894e1ade8cbd3d9cb8e7d633ab0

Okay, starting to look right now. What was the Hemera partition?

It was gpu. We tried running it on k80 with a smaller box and less macro particles but it didn't help. I just tried LWFA on k20 with Ions, collisions on and got the same error. Now I'm trying LWFA again with a much smaller particle box.

These is the output with PMACC_BLOCKING_KERNEL=ON
stdout.txt
stderr.txt

These is the output with PMACC_BLOCKING_KERNEL=ON
stdout.txt
stderr.txt

locls like alpaka is catching the error before we xreate a nice error in pmacc.

Yes, looks the same for me. Generally, I think we should also make some more hierarchical PIConGPU error handling to e.g. know what stage failed and not just something failed.

If this is the case, it may be actually harmful for debugging? And so alpaka may need a similar mechanism or option to throw vs. print and throw? @psychocoderHPC

If this is the case, it may be actually harmful for debugging? And so alpaka may need a similar mechanism or option to throw vs. print and throw? @psychocoderHPC

Yes we need to catch the error and provide our message.

Sorry it took long, but with #3396 we should have more informative error messages when a crash happens inside of a kernel. This works only when the option PMACC_BLOCKING_KERNEL is enabled (which it should be for debugging), e.g. like pic-build -c "-DPMACC_BLOCKING_KERNEL=ON". I think you could already take these changes and put on the branch in question, it's all in one file.

With the LWFA example, ions on and ionisation disabled.

  • ./bin/picongpu -d 1 1 1 -p 0 0 1 -g 192 1024 36 -s 100 crashes at 78%
  • ./bin/picongpu -d 1 1 1 -p 0 0 1 -g 96 1024 36 -s 100 runs
  • mpi-exec -n 2 ./bin/picongpu -d 2 1 1 -p 0 0 1 -g 192 1024 36 -s 100 crashes at 84%
  • mpi-exec -n 4 ./bin/picongpu -d 4 1 1 -p 0 0 1 -g 192 1024 36 -s 100 crashes at 65%

So it doesn't really look like it is just not enough memory. with "-DPMACC_BLOCKING_KERNEL=ON" it says that it crashes at the collision kernel.

I tried running it with cuda-memcheck but I didn't learn much from it. Maybe I need to set some extra compile flags? @sbastrakov @psychocoderHPC what do you think?
memcheck.log

So it doesn't really look like it is just not enough memory. with "-DPMACC_BLOCKING_KERNEL=ON" it says that it crashes at the collision kernel.

@pordyna okay, so I believe the output should now be trustworthy. I believe as a super-naive way of narrowing down the issue, one could gradually disable or comment out parts of that kernel? E.g. with a dichotomy approach: remove the second "half", then the second half of the failing half, etc. This is of course a bit repetitive, and maybe you have a better guess already. Just what I would do without knowing the details of your implementation.

@sbastrakov @HighIander so I was able to narrow it down a little bit. I tried to run my thermalization simulation with more cells and it still runs. Initially field solver, current and pushers were disabled. So I enabled all 3 and it crashed. With the Yee solver enabled, species needed to have both a current and a pusher attribute to trigger a crash.
I tried disabling the current and the pusher in my previous LWFA setup and the simulation was running for 1000 step, no error! The only difference to my thermalization case: For LWFA disabling/enabling current doesn't change anything. Error occurrence depends solely on a pusher.

I'm quite confused with this. Do you have any ideas? The error happens in the IntraCollisions kernel (same species) so the next thing for me would be to check if anything happens when there are only inter-species collisions on.

Just as a thought while reading your message. In case the particle pusher is used, there is also fillAllGaps called. Could it be that there is some problem in collisions that does not take into account that the last frame is not necessarily full? (Or it just does not call fillAllGaps afterwards)?

Take care https://github.com/pordyna/picongpu/blob/921902f80230370517bc9769f0eb8f1325f3029f/include/picongpu/particles/collision/IntraSpecies.hpp#L226 is only active if you not compile as release mode.

Am 27. Oktober 2020 15:46:45 MEZ schrieb "Pawe艂 Ordyna" notifications@github.com:

@sbastrakov @HighIander so I was able to narrow it down a little bit.
I tried to run my thermalization simulation with more cells and it
still runs. Initially field solver, current and pushers were disabled.
So I enabled all 3 and it crushed. With the Yee solver enabled, species
needed to have both a current and a pusher attribute to trigger a
crush.
I tried disabling the current and the pusher in my previous LWFA setup
and the simulation was running for 1000 step, no error! The only
difference to my thermalization case: For LWFA disabling/enabling
current doesn't change anything. Error occurrence depends solely on a
pusher.

I'm quite confused with this. Do you have any ideas? The error happens
in the IntraCollisions kernel (same species) so the next thing for me
would be to check if anything happens when there are only inter-species
collisions on.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/3381#issuecomment-717295021

@psychocoderHPC I have removed this asserts already, since they won't work on device side anyway. I should put there an if statement for now.

Is this part of the code executed if there is only one particle in the cell?

@sbastrakov I do sth like this:

auto & superCell = pb.getSuperCell( superCellIdx );
uint32_t numParticlesInSupercell = superCell.getNumParticles();

and check if frameId * framSize + index < numParticlesInSupercell

I'm not calling `fillAllGaps' in my kernel, should I? what does it do exactly?

@psychocoderHPC I was just thinking about it. I think yes. This could explain it. It definitely shouldn't be.

@pordyna then I think you are fine there. fillAllGaps is called from the host side after a kernel that can delete particles or move them between super cells (the latter happens after particle pusher and moveAndMark). But thinking now you probably don't do it in your operations?

@sbastrakov @psychocoderHPC It looks like this was it. It runs. I was either colliding with just one particle ( int intra-collisions), or with zero particles in in the shorter list (inter-collisions). Before I was only checking if the longer list is not empty.

@pordyna You can close the issue if it is solved.

I think only @paschk31 can close it.

Technically we can also click the button as maintainers. We just don't know if the issue is resolved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

berceanu picture berceanu  路  3Comments

ax3l picture ax3l  路  4Comments

mikewang2000 picture mikewang2000  路  3Comments

HighIander picture HighIander  路  4Comments

berceanu picture berceanu  路  4Comments