I recently switched from HDF5 to ADIOS output (for data dumps and checkpoints) on the taurus system. The code compiles flawlessly and also starts to run. However 45 minutes into the simulation, picongpu crashes with the following error:
# multiple versions of the error below with different byte sizes
ERROR: MPI_AMR method (with brigade strategy): Cannot allocate 2 x 2135610252 bytes for aggregation buffers. An aggregator process needs a buffer to hold one process' output for writing, while it needs another buffer to concurrently receive another process' outpterminate called after throwing an instance of 'std::runtime_error'
# multiple versions of the error below with different byte sizes
what(): ADIOS: error at cmd 'adios_close(threadParams->adiosFileHandle)' (-1, -1) in /scratch/p_electron/richard/picongpu2/include/pmacc/../picongpu/plugins/adios/ADIOSWriter.hpp:1556 MPI_AMR method (with brigade strategy): Cannot allocate 2 x 2135610252 bytes for aggregation buffers. An aggregator process needs a buffer to hold one process' output for writing, while it needs another buffer to concurrently receive another process' outp
# traceback origin of the error:
/picongpu(_ZN8picongpu5adios11ADIOSWriter10writeAdiosEPvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc305)[0x1f1bb25]
Due to the error, I switched back to HDF5 and that runs flawlessly.
I build ADIOS myself following the instructions of the dev version of readthedocs. I am running however on the 0.4.2 version. (Might this already cause the error?) I installed c-blosc and adios as described in the manual.
The version output of picongpu -v is:
PIConGPU: 0.4.2
Build-Type: Release
Third party:
OS: Linux-3.10.0-693.21.1.el7.x86_64
arch: x86_64
CXX: GNU (6.4.0)
CMake: 3.10.2
CUDA: 9.2.88
mallocMC: 2.3.0
Boost: 1.66.0
MPI:
standard: 3.1
flavor: OpenMPI (2.1.2)
PNGwriter: NOTFOUND
libSplash: 1.7.0 (Format 4.0)
ADIOS: 1.13.1
ldd shows no missing libraries.
My ADIOS OST setup for the run is:
# checkpoints:
--checkpoint.period 30000
--checkpoint.backend adios
--checkpoint.adios.aggregators 128
--checkpoint.adios.ost 32
--checkpoint.adios.transport-params
--checkpoint.adios.compression blosc:threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd
--checkpoint.adios.disable-meta 1
# data output:
--adios.period 5000
--adios.file simData
--adios.aggregators 128
--adios.ost 32
--adios.transport-params
--adios.compression blosc:threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd
--adios.disable-meta 1
--adios.source 'species_all,fields_all'
Might the error be caused by the not specified adios.transport-params? For the runs by @BeyondEspresso on the V100, this setup worked.
offline discussion with @psychocoderHPC:
One pssoble issue might be that the number of aggregators is smaller than the number of GPUs (128 < 144) and thus that some nodes require more memory if the collect data from another node.
Either setting --adios.aggregators 144 or not setting this flag at all might avoid the crash.
On the V100 the error did not occur since the Power9 CPUs have more memory. (~62GB << 441GB).
I will test setting no aggregators.
@BeyondEspresso How could you allocate 128 aggregators if you only used 28 GPUs?
Yes, in order to aggregate output from N GPUs (devices) during output over M
If a cluster is designed to have too little host-ram (e.g. not a good multiple of the device RAM) aggregation might not be possible. E.g. on Titan the ratio is 6:32GByte (1:4) which works well for us.
An alternative way on such systems is to perform (off-node) staging, but we are not fully there yet (MA thesis for ADIOS2 staging starting soon with @franzpoeschel, when we switch output to openPMD-api).
On which queue of Taurus do you run, how much host-RAM and how many GPUs of which kind are used per node and how many GPUs did you use in the run you describe (-d)?
That said, for the low number of nodes on Taurus you can probably skip aggregation as it should not bring much benefit besides loading the FS with a little less files, which are not a lot here (see manual above again for detailed reasoning, usually needed for few-thousand devices and more).
I use 144 GPUs on the k80. These nodes have 62000 MB RAM and 4 GPUs per node.
Thus each MPI rank has 15.1 GB of RAM - however - with particles this seems to be not enough for double buffering needed by ADIOS for the 24 GB / 2 = 12 GB per GPU.
Rerunning with no aggregators set lead to the same error.
What still confuses me: they only want to allocate ~4GB and fail. I will try a different memory setting. Perhaps we just die due to memory per rank issues (which is limited to ~2.5GB).
Just to summarize in device-speech: "256 NVIDIA Tesla K80 GPUs in 64 nodes" means in ZIH's docs actually 2x K80 per node means 4 "devices" with each 12 GByte device-RAM (total: 48 GByte device-memory per node).
62 GByte host-RAM per 48 GByte device-RAM is quite ill-designed (1:1.29), that's not even 2x the device-RAM per (cheap) host-RAM. I remember that we noticed that in the past... and should provide that as feedback @BeyondEspresso A good ratio is 1:2.5 to 1:3 (or more).
I want to add to this issue. I try to run a simulation on the taurus ml queue equipped with 6 x V100 on a node, which make use of a Power9 architecture. I compiled libraries with the script (https://gist.github.com/steindev/cc02eae81f465833afa27fc8880f3473).
My simulations runs fine with hdf5 output, but is terribly slow since the scratch filesystem is mounted via NFS only. Due to compression, I thought I could speed things up at least a little by using ADIOS.
Therefore, I changed my .cfg accordingly, i.e. removed hdf5 output and added adios configuration as follows
# ADIOS params
TBG_adios_agg="0"
TBG_adios_ost="32"
TBG_adios_transport_params="'stripe_count=4;stripe_size=1048576;block_size=1048576'"
TBG_adios_compression="'blosc:threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd'"
TBG_adios_additional_params="--adios.aggregators !TBG_adios_agg \
--adios.ost !TBG_adios_ost \
--adios.transport-params !TBG_adios_transport_params \
--adios.compression !TBG_adios_compression \
--adios.disable-meta 1"
But the simulation fails when writing the initial output at timestep 0. The error message is
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[taurusml6:116624] *** Process received signal ***
[taurusml6:116624] Signal: Aborted (6)
[taurusml6:116624] Signal code: (-6)
[taurusml6:116624] [ 0] [0x2000000504d8]
[taurusml6:116624] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x200000da1f94]
[taurusml6:116624] [ 2] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200000b3f7c4]
[taurusml6:116624] [ 3] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(+0xba524)[0x200000b3a524]
[taurusml6:116624] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x200000b3a5e0]
[taurusml6:116624] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x200000b3aa90]
[taurusml6:116624] [ 6] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znwm+0xa4)[0x200000b3b504]
[taurusml6:116624] [ 7] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znam+0x18)[0x200000b3b5f8]
[taurusml6:116624] [ 8] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter16CallWriteSpeciesINS_7plugins4misc13SpeciesFilterINS_9ParticlesIN5pmacc11compileTime6StringIJLc98ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEEN5boost3mpl6vectorINS_24placeholder_definition3914particlePusherINS_9particles6pusher12AccelerationENS7_24placeholder_definition2213pmacc_isAliasEEENS_24placeholder_definition385shapeINSG_6shapes3P4SESK_EENS_24placeholder_definition4513interpolationINS_28FieldToParticleInterpolationISP_NS_30AssignedTrilinearInterpolationEEESK_EENS_24placeholder_definition467currentINS_13currentSolver3EmZISP_EESK_EENS_24placeholder_definition5212densityRatioINS_25placeholder_definition14117DensityRatioBunchESK_EENS_24placeholder_definition509massRatioINS_25placeholder_definition13918MassRatioElectronsESK_EENS_24placeholder_definition5111chargeRatioINS_25placeholder_definition14020ChargeRatioElectronsESK_EEN4mpl_2naES1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_EENSC_6v_itemINS_24placeholder_definition309weightingENS1L_INS_24placeholder_definition288momentumENS1L_INS_24placeholder_definition258positionINS_24placeholder_definition2712position_picESK_EENSC_7vector0IS1J_EELi0EEELi0EEELi0EEEEENSG_6filter3AllEEEEclINS7_9DataSpaceILj2EEEEEvRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS2E_EEPNS0_12ThreadParamsET_+0x718)[0x11489a18]
[taurusml6:116624] [ 9] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter10writeAdiosEPvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x8d84)[0x11672324]
[taurusml6:116624] [10] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter8dumpDataEj+0xca8)[0x116c1a98]
[taurusml6:116624] [11] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter6notifyEj+0x130)[0x116c2720]
[taurusml6:116624] [12] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN5pmacc16SimulationHelperILj2EE11dumpOneStepEj+0x174)[0x1136f0e4]
[taurusml6:116624] [13] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN5pmacc16SimulationHelperILj2EE15startSimulationEv+0x310)[0x115c8d30]
[taurusml6:116624] [14] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0x12c)[0x115c971c]
[taurusml6:116624] [15] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(main+0x118)[0x111d6ad8]
[taurusml6:116624] [16] /usr/lib64/libc.so.6(+0x25100)[0x200000d85100]
[taurusml6:116624] [17] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x200000d852f4]
[taurusml6:116624] *** End of error message ***
srun: error: taurusml6: task 11: Killed
Since memory was already discussed, here some information:
GPU-Memory: 32256MiB (by nvidia-smi)
and from sinfo -n taurusml4 -O cpus,freemem,memory,gres
CPUS FREE_MEM MEMORY GRES
176 350655 254000 gpu:6
Honestly, I do not know where to start. Is this a configuration error, something wrong with ADIOS library, a cluster or a picongpu problem?
In order to exclude random errors, I resubmitted the same simulation twice. Interestingly, both simulations crashed without giving any reasonable error message. In both cases the stderr file reads (except for different node names)
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) modenv/ml
Module libpng/1.6.34-GCCcore-7.3.0, git/2.18.0-GCCcore-6.4.0, CMake/3.11.4-GCCcore-7.3.0, fosscuda/2018b and 20 dependencies unloaded.
Module fosscuda/2018b and 13 dependencies loaded.
Module CMake/3.11.4-GCCcore-7.3.0 and 1 dependency loaded.
The following have been reloaded with a version change:
1) GCCcore/7.3.0 => GCCcore/6.4.0
2) ncurses/6.1-GCCcore-7.3.0 => ncurses/6.0-GCCcore-6.4.0
3) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-6.4.0
Module git/2.18.0-GCCcore-6.4.0 and 9 dependencies loaded.
The following have been reloaded with a version change:
1) GCCcore/6.4.0 => GCCcore/7.3.0
2) zlib/1.2.11-GCCcore-6.4.0 => zlib/1.2.11-GCCcore-7.3.0
Module libpng/1.6.34-GCCcore-7.3.0 and 2 dependencies loaded.
srun: error: taurusml4: task 8: Killed
srun: Terminating job step 8271608.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8271608.1 ON taurusml3 CANCELLED AT 2019-04-12T11:49:22 ***
srun: error: taurusml3: task 0: Killed
srun: error: taurusml4: tasks 9-10: Killed
srun: error: taurusml7: tasks 18,20-23: Killed
srun: error: taurusml7: task 19: Killed
srun: error: taurusml4: task 7: Killed
srun: error: taurusml4: task 6: Killed
srun: error: taurusml4: task 11: Killed
srun: error: Timed out waiting for job step to complete
srun: error: taurusml6: task 12: Killed
while the stdout looks like
PIConGPU: 0.5.0-dev
Build-Type: Release
Third party:
OS: Linux-4.14.0-49.13.1.el7a.ppc64le
arch: ppc64le
CXX: GNU (7.3.0)
CMake: 3.11.4
CUDA: 9.2.148
mallocMC: 2.3.0
Boost: 1.68.0
MPI:
standard: 3.1
flavor: OpenMPI (3.1.1)
PNGwriter: 0.7.0
libSplash: 1.7.0 (Format 4.0)
ADIOS: 1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider2XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00056 ? 1
PIConGPUVerbose PHYSICS(1) | species b: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species H: omega_p * dt <= 0.1 ? 0.00233215
PIConGPUVerbose PHYSICS(1) | species C: omega_p * dt <= 0.1 ? 0.00403957
PIConGPUVerbose PHYSICS(1) | species N: omega_p * dt <= 0.1 ? 0.00436214
PIConGPUVerbose PHYSICS(1) | species O: omega_p * dt <= 0.1 ? 0.00468138
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 704.846
PIConGPUVerbose PHYSICS(1) | macro particles per device: 1562880000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 25.6359
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 2.67558e-18
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 8.0212e-10
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.33527e-29
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 4.10732e-18
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 6.3706e+14
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 2.12501e+06
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.09884e-12
initialization time: 35sec 978msec = 35 sec
Wait a second, now I see that the modules are loaded and unloaded in different versions which do not seem to fit to each other since they were compiled with different GCC versions.
I will investigate...
The git module exists only in dependence of GCCcore/6.4.0 and therefore modules were switched on git loading time. However, git is available on the nodes even without loading the module. Thus, I do not load the git module anymore in my V100_picongpu.profile which prevents module switching, especially for the compiler and zlib module. All libraries I take from the module system are now compatible with GCCcore/7.3.0.
Furthermore, I recompiled all libraries which are not provided by the module system, see install script at (https://gist.github.com/steindev/cc02eae81f465833afa27fc8880f3473), recompiled PIConGPU, and ran the simulation again.
Unfortunately, it still crashes on write of time step zero, although I have a new runtime error message.
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) modenv/ml
Module libpng/1.6.34-GCCcore-7.3.0, CMake/3.11.4-GCCcore-7.3.0, fosscuda/2018b and 14 dependencies unloaded.
Module fosscuda/2018b and 13 dependencies loaded.
Module CMake/3.11.4-GCCcore-7.3.0 and 1 dependency loaded.
Module libpng/1.6.34-GCCcore-7.3.0 loaded.
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x128651f0 valid_mask = 0x1)
[taurusml2][[13926,1],4][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: taurusml2
Local device: mlx5_1
--------------------------------------------------------------------------
srun: error: taurusml3: task 8: Killed
srun: Terminating job step 8336998.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8336998.1 ON taurusml2 CANCELLED AT 2019-04-14T20:13:10 ***
srun: error: taurusml6: task 21: Killed
#
# ...srun kills all the remaining tasks...
#
srun: error: Timed out waiting for job step to complete
I have tried this several times on different nodes. The error persists.
Googling the error message ibv_exp_query_device: invalid comp_mask !!! leads to an issues in the open-mpi open-mpi/ompi#5914 which seems to be fixed in a never version.
However, I am wondering why this error only occurs when using ADIOS output? I started the same simulation with HDF5 output and the new libraries as well yesterday and it runs fine.
Should I contact the Taurus system admin to install a newer MPI library or do we need to investigate this error within PIConGPU itself?
Just as a stupid guess. Can you check your ADIOS build used correct MPI? I once had some ADIOS + MPI problem on a Juelich machine and had to set values of CC, CXXand add --with-mpi=[my $MPI_ROOT] for building ADIOS. I understand this is unlikely, but should be rather easy to check.
Thanks for the hint. I added a definition of the MPI_ROOT variable to the V100_picongpu.profile and adjusted the compile time flag --with-mpi=$MPI_ROOT of the ADIOS build. Accordingly, I recompiled the libraray and PIConGPU afterwards and let the simulation run.
The situation disimproved. Instead of receiving the invalid comp_mask error now I do not receive any error message. :unamused:
The only thing I am left with is the EXIT CODE 137 which is written in the e-mail of the batch system.
The following is for your reference:
stdout:
Running program...
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ Note: You need to compile picongpu on a node. @
@ Likewise for building the libraries. @
@ Get a node with the getNode command. @
@ Then source V100_picongpu.profile again.@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
PIConGPU: 0.5.0-dev
Build-Type: Release
Third party:
OS: Linux-4.14.0-49.13.1.el7a.ppc64le
arch: ppc64le
CXX: GNU (7.3.0)
CMake: 3.11.4
CUDA: 9.2.148
mallocMC: 2.3.0
Boost: 1.68.0
MPI:
standard: 3.1
flavor: OpenMPI (3.1.1)
PNGwriter: 0.7.0
libSplash: 1.7.0 (Format 4.0)
ADIOS: 1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider2XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00056 ? 1
PIConGPUVerbose PHYSICS(1) | species b: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species H: omega_p * dt <= 0.1 ? 0.00233215
PIConGPUVerbose PHYSICS(1) | species C: omega_p * dt <= 0.1 ? 0.00403957
PIConGPUVerbose PHYSICS(1) | species N: omega_p * dt <= 0.1 ? 0.00436214
PIConGPUVerbose PHYSICS(1) | species O: omega_p * dt <= 0.1 ? 0.00468138
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 704.846
PIConGPUVerbose PHYSICS(1) | macro particles per device: 1562880000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 25.6359
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 2.67558e-18
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 8.0212e-10
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.33527e-29
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 4.10732e-18
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 6.3706e+14
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 2.12501e+06
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.09884e-12
initialization time: 36sec 491msec = 36 sec
stderr:
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) modenv/ml
Module libpng/1.6.34-GCCcore-7.3.0, CMake/3.11.4-GCCcore-7.3.0, fosscuda/2018b and 14 dependencies unloaded.
Module fosscuda/2018b and 13 dependencies loaded.
Module CMake/3.11.4-GCCcore-7.3.0 and 1 dependency loaded.
Module libpng/1.6.34-GCCcore-7.3.0 loaded.
srun: error: taurusml6: task 10: Killed
srun: Terminating job step 8419841.1
slurmstepd: error: *** STEP 8419841.1 ON taurusml4 CANCELLED AT 2019-04-17T14:33:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: taurusml6: task 9: Killed
srun: error: taurusml8: tasks 18,20: Killed
srun: error: taurusml6: task 8: Killed
srun: error: taurusml8: task 22: Killed
srun: error: taurusml8: tasks 19,23: Killed
srun: error: taurusml8: task 21: Killed
srun: error: taurusml4: task 0: Killed
srun: error: Timed out waiting for job step to complete
Could still be a memory issue. Both the first error
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
and getting killed by the watchdog daemon could imply you are running out of host-side RAM.
Can you try just using half the number of GPUs per node to verify?
Also, avoid going wild on threads=10 immediately as well. If I remember correctly, the slurm watchdog at ZIH also checks you are not overallocating threads compared to the requested number in the job script.
@ax3l: Great intuition that this could be a memory issue! Yesterday I tried as a first attempt to double the amount of GPUs for the simulation, which allowed the simulation to run for 7.25 hours. Although the simulation crashed then with the known errors:
Module GCCcore/7.3.0, binutils/2.30-GCCcore-7.3.0, GCC/7.3.0-2.30, CUDA/9.2.88-GCC-7.3.0-2.30, gcccuda/2018b, zlib/1.2.11-GCCcore-7.3.0, numactl/2.0.11-GCCcore-7.3.0, hwloc/1.11.10-GCCcore-7.3.0, OpenMPI/3.1.1-gcccuda-2018b, OpenBLAS/0.3.1-GCC-7.3.0-2.30, gompic/2018b, FFTW/3.3.8-gompic-2018b, ScaLAPACK/2.0.2-gompic-2018b-OpenBLAS-0.3.1, ncurses/6.1-GCCcore-7.3.0, fosscuda/2018b, CMake/3.11.4-GCCcore-7.3.0, libpng/1.6.34-GCCcore-7.3.0 unloaded.
Module GCCcore/7.3.0, binutils/2.30-GCCcore-7.3.0, GCC/7.3.0-2.30, CUDA/9.2.88-GCC-7.3.0-2.30, gcccuda/2018b, zlib/1.2.11-GCCcore-7.3.0, numactl/2.0.11-GCCcore-7.3.0, hwloc/1.11.10-GCCcore-7.3.0, OpenMPI/3.1.1-gcccuda-2018b, OpenBLAS/0.3.1-GCC-7.3.0-2.30, gompic/2018b, FFTW/3.3.8-gompic-2018b, ScaLAPACK/2.0.2-gompic-2018b-OpenBLAS-0.3.1, ncurses/6.1-GCCcore-7.3.0, fosscuda/2018b, CMake/3.11.4-GCCcore-7.3.0, libpng/1.6.34-GCCcore-7.3.0 loaded.
Module fosscuda/2018b unloaded.
Module fosscuda/2018b loaded.
Module CMake/3.11.4-GCCcore-7.3.0 unloaded.
Module CMake/3.11.4-GCCcore-7.3.0 loaded.
Module libpng/1.6.34-GCCcore-7.3.0 unloaded.
Module libpng/1.6.34-GCCcore-7.3.0 loaded.
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x20000aa4b8c0 valid_mask = 0x1)
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x3ea978f0 valid_mask = 0x1)
[taurusml11][[36663,1],46][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
[taurusml9][[36663,1],34][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: taurusml9
Local device: mlx5_1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: taurusml11
Local device: mlx5_1
--------------------------------------------------------------------------
srun: error: taurusml11: task 43: Killed
srun: Terminating job step 8425271.1
srun: error: taurusml7: task 19: Killed
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[taurusml7:89102] *** Process received signal ***
[taurusml7:89102] Signal: Aborted (6)
[taurusml7:89102] Signal code: (-6)
slurmstepd: error: *** STEP 8425271.1 ON taurusml3 CANCELLED AT 2019-04-18T02:43:16 ***
ERROR: MPI_AMR method (with brigade strategy): Cannot allocate 2 x 7175894857 bytes for aggregation buffers. An aggregator process needs a buffer to hold one process' output for writing, while it needs another buffer to concurrently receive another process' outpterminate called after throwing an instance of 'std::runtime_error'
what(): ADIOS: error at cmd 'adios_close(threadParams->adiosFileHandle)' (-1, -1) in /scratch/p_electron/2019-02_Bunch-through-foil/picongpu/include/pmacc/../picongpu/plugins/adios/ADIOSWriter.hpp:1556 MPI_AMR method (with brigade strategy): Cannot allocate 2 x 7175894857 bytes for aggregation buffers. An aggregator process needs a buffer to hold one process' output for writing, while it needs another buffer to concurrently receive another process' outp
[taurusml7:89104] *** Process received signal ***
[taurusml7:89104] Signal: Aborted (6)
[taurusml7:89104] Signal code: (-6)
[taurusml7:89102] [ 0] [0x2000000504d8]
[taurusml7:89102] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x200000da1f94]
[taurusml7:89102] [ 2] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200000b3f7c4]
[taurusml7:89102] [ 3] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(+0xba524)[0x200000b3a524]
[taurusml7:89102] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x200000b3a5e0]
[taurusml7:89102] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x200000b3aa90]
[taurusml7:89102] [ 6] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znwm+0xa4)[0x200000b3b504]
[taurusml7:89102] [ 7] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znam+0x18)[0x200000b3b5f8]
[taurusml7:89104] [ 0] [0x2000000504d8]
[taurusml7:89104] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x200000da1f94]
[taurusml7:89104] [ 2] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200000b3f7c4]
[taurusml7:89104] [ 3] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(+0xba524)[0x200000b3a524]
[taurusml7:89104] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x200000b3a5e0]
[taurusml7:89104] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x200000b3aa90]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[taurusml7:89104] [ 6] srun: error: taurusml7: task 23: Killed
# more tasks killed by srun
[taurusml7:89102] [ 8] srun: error: taurusml11: tasks 42,44: Killed
srun: error: taurusml6: tasks 12,14: Killed
# more tasks killed by srun
srun: error: taurusml11: task 46: Killed
Now at least, ADIOS tells that it needs more memory.
According to your suggestion yesterday, today I started a simulation using half the GPUs per node by adjusting the V100.tpl to use only 3 GPUs per node but twice the amount of memory per CPU.
It still runs.
I have a few more questions:
First, I do not understand your comment
Also, avoid going wild on
threads=10immediately as well. If I remember correctly, the slurm watchdog at ZIH also checks you are not overallocating threads compared to the requested number in the job script.
Could you please explain?
Second, is there something in the ADIOS configuration (e.g. Aggregators, OSTs, transport params etc. see above) that I can adjust to reduce ADIOS' memory footprint per node?
Also, currently the /scratch and /lustre/ssd/ file systems are only mounted via NFS at taurus ml partition. Is this of importance to my job, except of a possible I/O speed reduction?
@steindev Do you use aggregator with adios? If so please remove the aggregation since the memory on the host is not large enough for this feature.
... since the memory on the host is not large enough for this feature.
:disappointed:
@psychocoderHPC I use --adios.aggregators 0 which writes one file per GPU. I suppose therefore that data aggregation does not really happen. Please give a :+1: if this is what you meant.
Update: Unfortunately, the simulation using half the GPUs per node crasher after about 40min due to a node failure. It was automatically resubmitted But the same happened (at least) two more times. I canceled the simulation since I am not sure whether these node failures are caused by my PIConGPU simulation. I am wondering whether it is possible that somebody else received one of the remaining GPUs on the node and the jobs interfered. I added --exclusiveto the #SBATCH commands but now I am receiving cudamemtest errors.
@steindev What is the current status on this issue with the ml partition at taurus?
Did the replies from the support team help?
Update-2
@ax3l Your guess:
Could still be a memory issue.
...
seems to be correct. A simulation using only half of the GPUs per node, i.e. 3 instead of 6, ran without any error message. Therefore, it seems like the
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x20000aa4b8c0 valid_mask = 0x1)
was triggered by too few memory on the host-side.
For the records, in order to run with half the gpus per node, the usage of --exclusive does not help at all. In fact, it actually ignores the other options given to SLURM and launches more tasks per node than wanted.
What actually did the trick was to set --gres=gpu:3 (as expected) and change --mem-per-cpu=... to --mem=0 meaning that PIConGPU is granted all the available memory on a node.
Furthermore, the taurus admin updated OpenMPI to version 3.1.4 by now. So it may be possible, that future errors due to a lack of host-side memory will present a different error message.
By now the simulations runs using 4 GPUs / Node. I experienced no further problems so far during I/O. I think we can close the issue.
Most helpful comment
By now the simulations runs using 4 GPUs / Node. I experienced no further problems so far during I/O. I think we can close the issue.