I pulled the current dev branch 0f800d0 and created an example based on the Kelvin Helmholtz example. If I have hdf5 output enabled (the default), the simulation instantly crashes and gives me following error messages:
stdout
Running program...
PIConGPU: 0.5.0-dev
Build-Type: Release
Third party:
OS: Linux-3.10.0-693.11.6.el7.x86_64
arch: x86_64
CXX: GNU (7.3.0)
CMake: 3.11.3
CUDA: 9.2.88
mallocMC: 2.3.1
Boost: 1.68.0
MPI:
standard: 3.1
flavor: OpenMPI (2.1.6)
PNGwriter: 0.7.0
libSplash: 1.7.0 (Format 4.0)
ADIOS: 1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00556 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0319333
PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000745229
PIConGPUVerbose PHYSICS(1) | macro particles per device: 3686400
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 326.577
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.79e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 5.36628e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.97492e-28
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 5.23234e-17
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 9.5224e+12
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 31763.3
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.67372e-11
initialization time: 10sec 721msec = 10 sec
full simulation time: 10sec 741msec = 10 sec
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node gp005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpiexec during cleanup)
------------------------------
- - - - - job epilog - - - - -
------------------------------
Job ID: 1879910
was running on nodes: gp[005,008-010]
by user: carste06
in partition: gpu
using account: default
number of CPUs used: 96
number of nodes requested: 4
------------------------------
walltime reqd: 23:53:00
walltime used: 00:01:11
------------------------------
Mon Aug 26 15:24:45 CEST 2019
------------------------------
And here is the stderr which is too long to paste it directly into the issue:
stderr
I've run this simulation on the gpu partition of hemera with etc/include/16.cfg, but I think I've had it on k20 too. If I run the simulation without hdf5 output, it runs. It only seems to appear, when hdf5 output is enabled.
Hmm, the key line in stderr seems to be:
Unhandled exception of type 'St12out_of_range' with message 'vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)', terminating
The following little program reproduces the error, although I don't exactly know where yours comes from, yet.
#include <iostream>
#include <vector>
int bad_function() {
std::vector<int> v(2);
return v.at(2);
}
int main() {
bad_function();
return 0;
}
Output:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
If I start the Laser Wakefield example I get the same error messages...
Although this seams to be a logic error on our side, maybe check the last updates/PRs regarding the plugin, I would recommend to use parallel HDF5 not with the ancient OpenMPI version in the current module.
Maybe check if newer modules exist in parallel or request them? Here is a number of recommended releases to mitigate two severe OpenMPI bugs that we found with typical sims on Hemera:
https://github.com/openPMD/openPMD-api/issues/446
Use instead of OpenMPI (2.1.6) the versions:
Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.
That said: probably a logic bug in recent changes, best is to compile with debug symbols and find at which exact line in the code this is thrown.
Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.
I'm using the tpl files from the same branch and this line is included in them. So this does not mitigate the issue. But I'll also look into the other issues.
Yes, I think it's likely another issue. Nevertheless, I just recalled that export OMPI_MCA_io=^ompio will actually only work-around one issue out of the two that were fixed in the cited releases.
Feel free to post the line that throws this as soon as you found out.
@ax3l you are right, it's a recently introduced bug. @finnolec and I debugged it today, #3038 should fix it.
Fixed by #3038.
Most helpful comment
@ax3l you are right, it's a recently introduced bug. @finnolec and I debugged it today, #3038 should fix it.