Picongpu: Plugin Load: Free Crash (Hypnos K80 Only?)

Created on 28 Jan 2018  路  13Comments  路  Source: ComputationalRadiationPhysics/picongpu

I am currently seeing in about 1 of ~10 simulations a crash during initialization with current dev as of 0577bae174919ac5c0fbfec5b25f9c7dc0f0772d .

Looks like a memory violation.

It manifests e.g. in a free(): corrupted unsorted chunks in

picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::pluginLoad()

Setup

A multi-species ion simulation without ionization dynamics.

Environment

Hypnos (HZDR) with CUDA 8 backend on K80 (16 devices / sim).

        # Core Dependencies
        module load gcc/4.9.2
        module load cmake/3.7.2
        module load boost/1.62.0
        module load cuda/8.0
        module load openmpi/1.8.6.kepler.cuda80

        # Plugins (optional)
        module load pngwriter/0.7.0
        module load hdf5-parallel/1.8.15 libsplash/1.6.0

cc @psychocoderHPC

cuda machinsystem

All 13 comments

On K8p or K20? If it is on K80 it could be an cluster issue. We have currently MPI issues on K80.

Might be, I only saw the issues at kepler024 and kepler025 as far as I remember.

@ax3l I added you on the ticket for hypnos

I had similar problems with the Radiation plugin enabled, but with a chance of ~ 1:2 of PIConGPU crashing right at the beginning. If it runs, it runs well without errors.

@theZiz I see the same random behavior. I started to check this with a default LWFA setup (without the radiation plugin) on Friday and the issue persisted. I added you to the ticket as well.
I have the impression that the chance of an initial crash or hanging increases with the number of nodes involved.

I can confirm this. While testing for the Evaluation (on a part of the cluster) everything worked fine most of the time, but using the whole cluster the sh*t hit the fan.

Issue persists today. Roughly at the same failure rate.

It looks like it is only an issue of the k80 queue.

Remembering the strange steering issues I had for the last live visualization of PIConGPU? It also only happened on the k80 queue.

<-- sad little fellow 馃樋

I'll leave this open for self-help and free hugs until the issue is fixed

:hugs: :hugs: :hugs: :hugs: :bear:

no problems after maintenance

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cbontoiu picture cbontoiu  路  3Comments

berceanu picture berceanu  路  3Comments

HighIander picture HighIander  路  4Comments

ax3l picture ax3l  路  4Comments

ax3l picture ax3l  路  3Comments