Picongpu: Plugin Load: Free Crash (Hypnos K80 Only?)

Created on 28 Jan 2018 · 13Comments · Source: ComputationalRadiationPhysics/picongpu

I am currently seeing in about 1 of ~10 simulations a crash during initialization with current dev as of 0577bae174919ac5c0fbfec5b25f9c7dc0f0772d .

Looks like a memory violation.

It manifests e.g. in a free(): corrupted unsorted chunks in

picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::pluginLoad()

Setup

A multi-species ion simulation without ionization dynamics.

Environment

Hypnos (HZDR) with CUDA 8 backend on K80 (16 devices / sim).

        # Core Dependencies
        module load gcc/4.9.2
        module load cmake/3.7.2
        module load boost/1.62.0
        module load cuda/8.0
        module load openmpi/1.8.6.kepler.cuda80

        # Plugins (optional)
        module load pngwriter/0.7.0
        module load hdf5-parallel/1.8.15 libsplash/1.6.0

cc @psychocoderHPC

cuda machinsystem

Source

ax3l

All 13 comments

On K8p or K20? If it is on K80 it could be an cluster issue. We have currently MPI issues on K80.

psychocoderHPC on 28 Jan 2018

👍1

Might be, I only saw the issues at kepler024 and kepler025 as far as I remember.

ax3l on 29 Jan 2018

@ax3l I added you on the ticket for hypnos

PrometheusPi on 29 Jan 2018

👍1

I had similar problems with the Radiation plugin enabled, but with a chance of ~ 1:2 of PIConGPU crashing right at the beginning. If it runs, it runs well without errors.

theZiz on 29 Jan 2018

👍1

@theZiz I see the same random behavior. I started to check this with a default LWFA setup (without the radiation plugin) on Friday and the issue persisted. I added you to the ticket as well.
I have the impression that the chance of an initial crash or hanging increases with the number of nodes involved.

PrometheusPi on 29 Jan 2018

I can confirm this. While testing for the Evaluation (on a part of the cluster) everything worked fine most of the time, but using the whole cluster the sh*t hit the fan.

theZiz on 29 Jan 2018

Issue persists today. Roughly at the same failure rate.

ax3l on 30 Jan 2018

It looks like it is only an issue of the k80 queue.

psychocoderHPC on 30 Jan 2018

Remembering the strange steering issues I had for the last live visualization of PIConGPU? It also only happened on the k80 queue.

theZiz on 30 Jan 2018

<-- sad little fellow 😿

ax3l on 30 Jan 2018

I'll leave this open for self-help and free hugs until the issue is fixed

ax3l on 30 Jan 2018

:hugs: :hugs: :hugs: :hugs: :bear:

psychocoderHPC on 30 Jan 2018

❤1

no problems after maintenance

ax3l on 14 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

First install

cbontoiu · 38Comments

runtime error on topic-collisions "illegal memory access"

paschk31 · 26Comments

`splash::DCException` in the lwfa example when turning on hdf5 output

berceanu · 78Comments

openPMD error ADIOS2

PrometheusPi · 43Comments

Wexac Cluster at Weizmann

ax3l · 63Comments