Picongpu: LWFA CPU sim hangs on hzdr laser queue

Created on 27 Jul 2018 · 10Comments · Source: ComputationalRadiationPhysics/picongpu

After I fixed the compile error in the default laser wakefield example (#2652), I submitted a 8.cfg job using the laser.tpl.

The job did not crash but hangs with the following error (output):

 Data for JOB [42881,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: laser024        Num slots: 64   Max slots: 0    Num procs: 8
        Process OMPI jobid: [42881,1] App: 0 Process rank: 0
        Process OMPI jobid: [42881,1] App: 0 Process rank: 1
        Process OMPI jobid: [42881,1] App: 0 Process rank: 2
        Process OMPI jobid: [42881,1] App: 0 Process rank: 3
        Process OMPI jobid: [42881,1] App: 0 Process rank: 4
        Process OMPI jobid: [42881,1] App: 0 Process rank: 5
        Process OMPI jobid: [42881,1] App: 0 Process rank: 6
        Process OMPI jobid: [42881,1] App: 0 Process rank: 7

 =============================================================
[1,0]<stdout>:PIConGPU: 0.4.0-dev
[1,0]<stdout>:  Build-Type: Release
[1,0]<stdout>:
[1,0]<stdout>:Third party:
[1,0]<stdout>:  OS:         Linux-4.4.0-38-generic
[1,0]<stdout>:  arch:       x86_64
[1,0]<stdout>:  CXX:        GNU (4.9.2)
[1,0]<stdout>:  CMake:      3.10.1
[1,0]<stdout>:  Boost:      1.62.0
[1,0]<stdout>:  MPI:        
[1,0]<stdout>:    standard: 3.0
[1,0]<stdout>:    flavor:   OpenMPI (1.8.6)
[1,0]<stdout>:  PNGwriter:  0.7.0
[1,0]<stdout>:  libSplash:  1.7.0 (Format 4.0)
[1,0]<stdout>:  ADIOS:      NOTFOUND
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42

It furthermore created a back trace file picongpu.80s-22833,laser024.btr with the following content:

picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9 SP=7fe344099df8.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081]
/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]

@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the error?

omp2b bug examples third party

Source

PrometheusPi

Most helpful comment

From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.

psychocoderHPC on 2 Aug 2018

👍2

All 10 comments

The backtrace is not showing any usefull information. I will try to reproduce it next week.

Am 27. Juli 2018 20:16:12 MESZ schrieb Richard Pausch notifications@github.com:

After I fixed the compile error in the default laser wakefield example
(#2652), I submitted a 8.cfg job using the laser.tpl.

The job did not crash but hangs with the following error (output):

Data for JOB [42881,1] offset 0

========================   JOB MAP   ========================

Data for node: laser024        Num slots: 64   Max slots: 0    Num
procs: 8
       Process OMPI jobid: [42881,1] App: 0 Process rank: 0
       Process OMPI jobid: [42881,1] App: 0 Process rank: 1
       Process OMPI jobid: [42881,1] App: 0 Process rank: 2
       Process OMPI jobid: [42881,1] App: 0 Process rank: 3
       Process OMPI jobid: [42881,1] App: 0 Process rank: 4
       Process OMPI jobid: [42881,1] App: 0 Process rank: 5
       Process OMPI jobid: [42881,1] App: 0 Process rank: 6
       Process OMPI jobid: [42881,1] App: 0 Process rank: 7

=============================================================
[1,0]<stdout>:PIConGPU: 0.4.0-dev
[1,0]<stdout>:  Build-Type: Release
[1,0]<stdout>:
[1,0]<stdout>:Third party:
[1,0]<stdout>:  OS:         Linux-4.4.0-38-generic
[1,0]<stdout>:  arch:       x86_64
[1,0]<stdout>:  CXX:        GNU (4.9.2)
[1,0]<stdout>:  CMake:      3.10.1
[1,0]<stdout>:  Boost:      1.62.0
[1,0]<stdout>:  MPI:        
[1,0]<stdout>:    standard: 3.0
[1,0]<stdout>:    flavor:   OpenMPI (1.8.6)
[1,0]<stdout>:  PNGwriter:  0.7.0
[1,0]<stdout>:  libSplash:  1.7.0 (Format 4.0)
[1,0]<stdout>:  ADIOS:      NOTFOUND
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number
Generator: RNGProvider3AlpakaRand seed: 42

It furthermore created a back trace file
picongpu.80s-22833,laser024.btr with the following content:

picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9
SP=7fe344099df8.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081]
/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]

@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the
error?

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2653

psychocoderHPC on 27 Jul 2018

👍1

Can reproduce with current dev, even with gcc 5.3.0

8.cfg: crashes as described
1.cfg: works well
2.cfg: (derived from 1.cfg with 1x2x1, no -m) works well
4.cfg: (derived from 1.cfg with 1x4x1, no -m) works well
8.cfg: (derived from 1.cfg with 2x2x2, no -m) works well
2.cfg: (derived from 1.cfg with 1x2x1, with -m) works well

ax3l on 2 Aug 2018

The problem with the original 8.cfg is that it runs out of RAM.
Somehow the setup ("8 CPU devices" = 8 processes with total of 64 threads) cuts the 256 GByte RAM with its 192x1024x192 cells and 9437184 macro particles per device.

ax3l on 2 Aug 2018

@ax3l I'm wondering how is running out of RAM possible? According to this page the laser024 node which has all 8 processes in the original log has 256 GB memory, and the grid size in 8.cfg is only 192x1024x192. Am i missing smth, or maybe the memory is somehow preallocated too aggressively ? Also, the simple running out of memory (on CPU with new at least) should be easy to report properly, maybe it's worth doing so that at least we know for sure when it happens.

sbastrakov on 2 Aug 2018

Yes, it's obviously not normal that it runs out of RAM for this little setup.

easy to report properly

not reported properly from the app here. probably crashing something on the networking layer or OS level.

ax3l on 2 Aug 2018

Not sure, but could smth like this work to at least report issues with new?

sbastrakov on 2 Aug 2018

Looks good, but I suspect it's not a new in PIConGPU itself that is crashing here, but some other process that we block out of getting new RAM. Otherwise we should get a regular error, visible already.

E.g. stderr reads:

[1,7]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/tbg/cpuNumaStarter.sh: line 90: 42826 Killed                  numactl --cpunodebind="$useNumaNode" --preferred="$useNumaNode" $*
[1,3]<stderr>:
[1,3]<stderr>:picongpu:42827 terminated with signal 11 at PC=7f9294c8cee9 SP=7f928528adb8.  Backtrace:
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7f9294c8cee9]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x5c090c]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x45e071]
[1,3]<stderr>:/opt/pkg/compiler/gnu/gcc/5.3.0/lib64/libstdc++.so.6(+0xbdc60)[0x7f92954bac60]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f9297ded184]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f9294cef37d]
mpiexec: abort is already in progress...hit ctrl-c again to forcibly terminate

Anyway, the point here is to find out where we loose the mem in the first place.

ax3l on 2 Aug 2018

From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.

psychocoderHPC on 2 Aug 2018

👍2

Ok, from my runtime test, Alpaka 0.3.3 will reduce the memory footprint by a factor 10 for the 8.cfg LWFA example :)

We will need a few days before Alpaka 0.3.3 is released and lands in PIConGPU
https://github.com/ComputationalRadiationPhysics/alpaka/pull/588

ax3l on 7 Aug 2018

Fixed with #2684 :-)

ax3l on 13 Aug 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

travis-ci suggestions only valid for GNU sed

PrometheusPi · 3Comments

command line parsing

psychocoderHPC · 4Comments

PIConGPU Development Visualization

berceanu · 3Comments

spack installation fails to fetch pkgconf

hightower8083 · 4Comments

Minor syntax error in SingleParticleTest configuration file

mikewang2000 · 3Comments