Picongpu: LWFA CPU sim hangs on hzdr laser queue

Created on 27 Jul 2018  路  10Comments  路  Source: ComputationalRadiationPhysics/picongpu

After I fixed the compile error in the default laser wakefield example (#2652), I submitted a 8.cfg job using the laser.tpl.

The job did not crash but hangs with the following error (output):

 Data for JOB [42881,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: laser024        Num slots: 64   Max slots: 0    Num procs: 8
        Process OMPI jobid: [42881,1] App: 0 Process rank: 0
        Process OMPI jobid: [42881,1] App: 0 Process rank: 1
        Process OMPI jobid: [42881,1] App: 0 Process rank: 2
        Process OMPI jobid: [42881,1] App: 0 Process rank: 3
        Process OMPI jobid: [42881,1] App: 0 Process rank: 4
        Process OMPI jobid: [42881,1] App: 0 Process rank: 5
        Process OMPI jobid: [42881,1] App: 0 Process rank: 6
        Process OMPI jobid: [42881,1] App: 0 Process rank: 7

 =============================================================
[1,0]<stdout>:PIConGPU: 0.4.0-dev
[1,0]<stdout>:  Build-Type: Release
[1,0]<stdout>:
[1,0]<stdout>:Third party:
[1,0]<stdout>:  OS:         Linux-4.4.0-38-generic
[1,0]<stdout>:  arch:       x86_64
[1,0]<stdout>:  CXX:        GNU (4.9.2)
[1,0]<stdout>:  CMake:      3.10.1
[1,0]<stdout>:  Boost:      1.62.0
[1,0]<stdout>:  MPI:        
[1,0]<stdout>:    standard: 3.0
[1,0]<stdout>:    flavor:   OpenMPI (1.8.6)
[1,0]<stdout>:  PNGwriter:  0.7.0
[1,0]<stdout>:  libSplash:  1.7.0 (Format 4.0)
[1,0]<stdout>:  ADIOS:      NOTFOUND
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42

It furthermore created a back trace file picongpu.80s-22833,laser024.btr with the following content:

picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9 SP=7fe344099df8.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081]
/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]

@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the error?

omp2b bug examples third party

Most helpful comment

From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.

All 10 comments

The backtrace is not showing any usefull information. I will try to reproduce it next week.

Am 27. Juli 2018 20:16:12 MESZ schrieb Richard Pausch notifications@github.com:

After I fixed the compile error in the default laser wakefield example
(#2652), I submitted a 8.cfg job using the laser.tpl.

The job did not crash but hangs with the following error (output):

Data for JOB [42881,1] offset 0

========================   JOB MAP   ========================

Data for node: laser024        Num slots: 64   Max slots: 0    Num
procs: 8
       Process OMPI jobid: [42881,1] App: 0 Process rank: 0
       Process OMPI jobid: [42881,1] App: 0 Process rank: 1
       Process OMPI jobid: [42881,1] App: 0 Process rank: 2
       Process OMPI jobid: [42881,1] App: 0 Process rank: 3
       Process OMPI jobid: [42881,1] App: 0 Process rank: 4
       Process OMPI jobid: [42881,1] App: 0 Process rank: 5
       Process OMPI jobid: [42881,1] App: 0 Process rank: 6
       Process OMPI jobid: [42881,1] App: 0 Process rank: 7

=============================================================
[1,0]<stdout>:PIConGPU: 0.4.0-dev
[1,0]<stdout>:  Build-Type: Release
[1,0]<stdout>:
[1,0]<stdout>:Third party:
[1,0]<stdout>:  OS:         Linux-4.4.0-38-generic
[1,0]<stdout>:  arch:       x86_64
[1,0]<stdout>:  CXX:        GNU (4.9.2)
[1,0]<stdout>:  CMake:      3.10.1
[1,0]<stdout>:  Boost:      1.62.0
[1,0]<stdout>:  MPI:        
[1,0]<stdout>:    standard: 3.0
[1,0]<stdout>:    flavor:   OpenMPI (1.8.6)
[1,0]<stdout>:  PNGwriter:  0.7.0
[1,0]<stdout>:  libSplash:  1.7.0 (Format 4.0)
[1,0]<stdout>:  ADIOS:      NOTFOUND
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number
Generator: RNGProvider3AlpakaRand seed: 42

It furthermore created a back trace file
picongpu.80s-22833,laser024.btr with the following content:

picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9
SP=7fe344099df8.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081]
/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]

@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the
error?

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2653

Can reproduce with current dev, even with gcc 5.3.0

  • 8.cfg: crashes as described
  • 1.cfg: works well
  • 2.cfg: (derived from 1.cfg with 1x2x1, no -m) works well
  • 4.cfg: (derived from 1.cfg with 1x4x1, no -m) works well
  • 8.cfg: (derived from 1.cfg with 2x2x2, no -m) works well
  • 2.cfg: (derived from 1.cfg with 1x2x1, with -m) works well

The problem with the original 8.cfg is that it runs out of RAM.
Somehow the setup ("8 CPU devices" = 8 processes with total of 64 threads) cuts the 256 GByte RAM with its 192x1024x192 cells and 9437184 macro particles per device.

@ax3l I'm wondering how is running out of RAM possible? According to this page the laser024 node which has all 8 processes in the original log has 256 GB memory, and the grid size in 8.cfg is only 192x1024x192. Am i missing smth, or maybe the memory is somehow preallocated too aggressively ? Also, the simple running out of memory (on CPU with new at least) should be easy to report properly, maybe it's worth doing so that at least we know for sure when it happens.

Yes, it's obviously not normal that it runs out of RAM for this little setup.

easy to report properly

not reported properly from the app here. probably crashing something on the networking layer or OS level.

Not sure, but could smth like this work to at least report issues with new?

Looks good, but I suspect it's not a new in PIConGPU itself that is crashing here, but some other process that we block out of getting new RAM. Otherwise we should get a regular error, visible already.

E.g. stderr reads:

[1,7]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/tbg/cpuNumaStarter.sh: line 90: 42826 Killed                  numactl --cpunodebind="$useNumaNode" --preferred="$useNumaNode" $*
[1,3]<stderr>:
[1,3]<stderr>:picongpu:42827 terminated with signal 11 at PC=7f9294c8cee9 SP=7f928528adb8.  Backtrace:
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7f9294c8cee9]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x5c090c]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x45e071]
[1,3]<stderr>:/opt/pkg/compiler/gnu/gcc/5.3.0/lib64/libstdc++.so.6(+0xbdc60)[0x7f92954bac60]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f9297ded184]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f9294cef37d]
mpiexec: abort is already in progress...hit ctrl-c again to forcibly terminate

Anyway, the point here is to find out where we loose the mem in the first place.

From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.

Ok, from my runtime test, Alpaka 0.3.3 will reduce the memory footprint by a factor 10 for the 8.cfg LWFA example :)

We will need a few days before Alpaka 0.3.3 is released and lands in PIConGPU
https://github.com/ComputationalRadiationPhysics/alpaka/pull/588

Fixed with #2684 :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

PrometheusPi picture PrometheusPi  路  3Comments

psychocoderHPC picture psychocoderHPC  路  4Comments

berceanu picture berceanu  路  3Comments

hightower8083 picture hightower8083  路  4Comments

mikewang2000 picture mikewang2000  路  3Comments