After I fixed the compile error in the default laser wakefield example (#2652), I submitted a 8.cfg job using the laser.tpl.
The job did not crash but hangs with the following error (output):
Data for JOB [42881,1] offset 0
======================== JOB MAP ========================
Data for node: laser024 Num slots: 64 Max slots: 0 Num procs: 8
Process OMPI jobid: [42881,1] App: 0 Process rank: 0
Process OMPI jobid: [42881,1] App: 0 Process rank: 1
Process OMPI jobid: [42881,1] App: 0 Process rank: 2
Process OMPI jobid: [42881,1] App: 0 Process rank: 3
Process OMPI jobid: [42881,1] App: 0 Process rank: 4
Process OMPI jobid: [42881,1] App: 0 Process rank: 5
Process OMPI jobid: [42881,1] App: 0 Process rank: 6
Process OMPI jobid: [42881,1] App: 0 Process rank: 7
=============================================================
[1,0]<stdout>:PIConGPU: 0.4.0-dev
[1,0]<stdout>: Build-Type: Release
[1,0]<stdout>:
[1,0]<stdout>:Third party:
[1,0]<stdout>: OS: Linux-4.4.0-38-generic
[1,0]<stdout>: arch: x86_64
[1,0]<stdout>: CXX: GNU (4.9.2)
[1,0]<stdout>: CMake: 3.10.1
[1,0]<stdout>: Boost: 1.62.0
[1,0]<stdout>: MPI:
[1,0]<stdout>: standard: 3.0
[1,0]<stdout>: flavor: OpenMPI (1.8.6)
[1,0]<stdout>: PNGwriter: 0.7.0
[1,0]<stdout>: libSplash: 1.7.0 (Format 4.0)
[1,0]<stdout>: ADIOS: NOTFOUND
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42
It furthermore created a back trace file picongpu.80s-22833,laser024.btr with the following content:
picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9 SP=7fe344099df8. Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc]
/bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081]
/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]
@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the error?
The backtrace is not showing any usefull information. I will try to reproduce it next week.
Am 27. Juli 2018 20:16:12 MESZ schrieb Richard Pausch notifications@github.com:
After I fixed the compile error in the default laser wakefield example
(#2652), I submitted a8.cfgjob using thelaser.tpl.The job did not crash but hangs with the following error (output):
Data for JOB [42881,1] offset 0 ======================== JOB MAP ======================== Data for node: laser024 Num slots: 64 Max slots: 0 Num procs: 8 Process OMPI jobid: [42881,1] App: 0 Process rank: 0 Process OMPI jobid: [42881,1] App: 0 Process rank: 1 Process OMPI jobid: [42881,1] App: 0 Process rank: 2 Process OMPI jobid: [42881,1] App: 0 Process rank: 3 Process OMPI jobid: [42881,1] App: 0 Process rank: 4 Process OMPI jobid: [42881,1] App: 0 Process rank: 5 Process OMPI jobid: [42881,1] App: 0 Process rank: 6 Process OMPI jobid: [42881,1] App: 0 Process rank: 7 ============================================================= [1,0]<stdout>:PIConGPU: 0.4.0-dev [1,0]<stdout>: Build-Type: Release [1,0]<stdout>: [1,0]<stdout>:Third party: [1,0]<stdout>: OS: Linux-4.4.0-38-generic [1,0]<stdout>: arch: x86_64 [1,0]<stdout>: CXX: GNU (4.9.2) [1,0]<stdout>: CMake: 3.10.1 [1,0]<stdout>: Boost: 1.62.0 [1,0]<stdout>: MPI: [1,0]<stdout>: standard: 3.0 [1,0]<stdout>: flavor: OpenMPI (1.8.6) [1,0]<stdout>: PNGwriter: 0.7.0 [1,0]<stdout>: libSplash: 1.7.0 (Format 4.0) [1,0]<stdout>: ADIOS: NOTFOUND [1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is ON ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- [1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42It furthermore created a back trace file
picongpu.80s-22833,laser024.btrwith the following content:picongpu:22833 terminated with signal 11 at PC=7fe353ee2ee9 SP=7fe344099df8. Backtrace: /lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7fe353ee2ee9] /bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x5915cc] /bigdata/hplsim/scratch/.../_LWFA_cpu/runs/001_default_LWFA/input/bin/picongpu[0x467081] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0xc1ed0)[0x7fe354704ed0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe356fe3184] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe353f4537d]@psychocoderHPC @sbastrakov @ax3l Do you have any idea what causes the
error?--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2653
Can reproduce with current dev, even with gcc 5.3.0
8.cfg: crashes as described1.cfg: works well2.cfg: (derived from 1.cfg with 1x2x1, no -m) works well4.cfg: (derived from 1.cfg with 1x4x1, no -m) works well8.cfg: (derived from 1.cfg with 2x2x2, no -m) works well2.cfg: (derived from 1.cfg with 1x2x1, with -m) works wellThe problem with the original 8.cfg is that it runs out of RAM.
Somehow the setup ("8 CPU devices" = 8 processes with total of 64 threads) cuts the 256 GByte RAM with its 192x1024x192 cells and 9437184 macro particles per device.
@ax3l I'm wondering how is running out of RAM possible? According to this page the laser024 node which has all 8 processes in the original log has 256 GB memory, and the grid size in 8.cfg is only 192x1024x192. Am i missing smth, or maybe the memory is somehow preallocated too aggressively ? Also, the simple running out of memory (on CPU with new at least) should be easy to report properly, maybe it's worth doing so that at least we know for sure when it happens.
Yes, it's obviously not normal that it runs out of RAM for this little setup.
easy to report properly
not reported properly from the app here. probably crashing something on the networking layer or OS level.
Not sure, but could smth like this work to at least report issues with new?
Looks good, but I suspect it's not a new in PIConGPU itself that is crashing here, but some other process that we block out of getting new RAM. Otherwise we should get a regular error, visible already.
E.g. stderr reads:
[1,7]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/tbg/cpuNumaStarter.sh: line 90: 42826 Killed numactl --cpunodebind="$useNumaNode" --preferred="$useNumaNode" $*
[1,3]<stderr>:
[1,3]<stderr>:picongpu:42827 terminated with signal 11 at PC=7f9294c8cee9 SP=7f928528adb8. Backtrace:
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x97ee9)[0x7f9294c8cee9]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x5c090c]
[1,3]<stderr>:/bigdata/hplsim/scratch/huebl/lwfa-laser-008/input/bin/picongpu[0x45e071]
[1,3]<stderr>:/opt/pkg/compiler/gnu/gcc/5.3.0/lib64/libstdc++.so.6(+0xbdc60)[0x7f92954bac60]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f9297ded184]
[1,3]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f9294cef37d]
mpiexec: abort is already in progress...hit ctrl-c again to forcibly terminate
Anyway, the point here is to find out where we loose the mem in the first place.
From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.
Ok, from my runtime test, Alpaka 0.3.3 will reduce the memory footprint by a factor 10 for the 8.cfg LWFA example :)
We will need a few days before Alpaka 0.3.3 is released and lands in PIConGPU
https://github.com/ComputationalRadiationPhysics/alpaka/pull/588
Fixed with #2684 :-)
Most helpful comment
From my tests it looks like it is the random number generator. I will check the cpu state size or if we pass wrong sizes to the RNG.