Picongpu: Error with hdf5 output on current dev branch

Created on 26 Aug 2019 · 7Comments · Source: ComputationalRadiationPhysics/picongpu

I pulled the current dev branch 0f800d0 and created an example based on the Kelvin Helmholtz example. If I have hdf5 output enabled (the default), the simulation instantly crashes and gives me following error messages:

stdout

Running program...
PIConGPU: 0.5.0-dev
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-693.11.6.el7.x86_64
  arch:       x86_64
  CXX:        GNU (7.3.0)
  CMake:      3.11.3
  CUDA:       9.2.88
  mallocMC:   2.3.1
  Boost:      1.68.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (2.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00556 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0319333
PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000745229
PIConGPUVerbose PHYSICS(1) | macro particles per device: 3686400
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 326.577
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.79e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 5.36628e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.97492e-28
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 5.23234e-17
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 9.5224e+12
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 31763.3
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.67372e-11
initialization time: 10sec 721msec = 10 sec
full simulation time: 10sec 741msec = 10 sec
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node gp005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpiexec during cleanup)

------------------------------
- - - - - job epilog - - - - -
------------------------------
Job ID: 1879910
was running on nodes: gp[005,008-010]
by user: carste06
in partition: gpu
using account: default
number of CPUs used: 96
number of nodes requested: 4
------------------------------
walltime reqd: 23:53:00
walltime used: 00:01:11
------------------------------
Mon Aug 26 15:24:45 CEST 2019
------------------------------

And here is the stderr which is too long to paste it directly into the issue:
stderr

I've run this simulation on the gpu partition of hemera with etc/include/16.cfg, but I think I've had it on k20 too. If I run the simulation without hdf5 output, it runs. It only seems to appear, when hdf5 output is enabled.

bug plugin

Source

finnolec

👍2

Most helpful comment

@ax3l you are right, it's a recently introduced bug. @finnolec and I debugged it today, #3038 should fix it.

sbastrakov on 28 Aug 2019

❤2 👍1

All 7 comments

Hmm, the key line in stderr seems to be:

Unhandled exception of type 'St12out_of_range' with message 'vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)', terminating

The following little program reproduces the error, although I don't exactly know where yours comes from, yet.

#include <iostream>
#include <vector>

int bad_function() {
    std::vector<int> v(2);
    return v.at(2);
}

int main() {
    bad_function();
    return 0;
}

Output:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)

n01r on 26 Aug 2019

If I start the Laser Wakefield example I get the same error messages...

finnolec on 26 Aug 2019

Although this seams to be a logic error on our side, maybe check the last updates/PRs regarding the plugin, I would recommend to use parallel HDF5 not with the ancient OpenMPI version in the current module.

Maybe check if newer modules exist in parallel or request them? Here is a number of recommended releases to mitigate two severe OpenMPI bugs that we found with typical sims on Hemera:
https://github.com/openPMD/openPMD-api/issues/446
Use instead of OpenMPI (2.1.6) the versions:

v3.0.4 or newer for the 3.0.X line
v3.1.4 or newer for the 3.1.X line
v4.0.1 or newer

Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.

That said: probably a logic bug in recent changes, best is to compile with debug symbols and find at which exact line in the code this is thrown.

ax3l on 26 Aug 2019

👍1

Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.

I'm using the tpl files from the same branch and this line is included in them. So this does not mitigate the issue. But I'll also look into the other issues.

finnolec on 27 Aug 2019

Yes, I think it's likely another issue. Nevertheless, I just recalled that export OMPI_MCA_io=^ompio will actually only work-around one issue out of the two that were fixed in the cited releases.

Feel free to post the line that throws this as soon as you found out.

ax3l on 27 Aug 2019

@ax3l you are right, it's a recently introduced bug. @finnolec and I debugged it today, #3038 should fix it.

sbastrakov on 28 Aug 2019

❤2 👍1

Fixed by #3038.

sbastrakov on 30 Aug 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings