When using the current dev branch to output adios data via the openPMD-api plugin on hemera k80, I ran into the following error:
terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
what(): ERROR: BZ_OUTBUFF_FULL BZIP2 detected size of compressed data is larger than destination length in call to ADIOS2 BZIP2 Compress batch 0: iostream error
```
Details of the call are:
--openPMD.period 100
--openPMD.file simOutput
--openPMD.ext bp
--openPMD.json '{
"adios2": {
"dataset": {
"operators": [ { "type": "bzip2" } ] },
"engine": {
"type": "file",
"parameters": { "BufferGrowthFactor": "1.2", "InitialBufferSize": "2GB" } } } }'
```
I will try it without:
"dataset": {
"operators": [ { "type": "bzip2" } ] },
This is a "BUG" in ADIOS2. The implementation of the compressor in ADIOS2 differs from ADIOS1.
If you use BZIP2 the internal ADIOS2 buffer will be 10% larger then the chunk to compress
If you have data that can not be well compressed by bzip2 it could be that the compressed data is larger than the internal ADIOS buffer. IN this case, you get this error: https://github.com/ornladios/ADIOS2/blob/3b4e3e04c30b51907386b24a672fbbc752935cea/source/adios2/operator/compress/CompressBZIP2.cpp#L193
Additionally, ADIOS2 is using double-precision operations to calculate the new buffer size which can have large rounding issues if your variable (ADIOS speech for data) per MPI rank is more than 2GiB in size.
btw: with ADIOS2 2.7.1 blosc compression is fixed. The blosc compressor will detect if the compressed buffer is to large and simply disable compression for the variable and store the data uncompressed.
I will try it without:
"dataset": { "operators": [ { "type": "bzip2" } ] },
yes disabling compression will help in your case
This is a "BUG" in ADIOS2. The implementation of the compressor in ADIOS2 differs from ADIOS1.
If you use BZIP2 the internal ADIOS2 buffer will be 10% larger then the chunk to compress
If you have data that can not be well compressed by bzip2 it could be that the compressed data is larger than the internal ADIOS buffer. IN this case, you get this error: https://github.com/ornladios/ADIOS2/blob/3b4e3e04c30b51907386b24a672fbbc752935cea/source/adios2/operator/compress/CompressBZIP2.cpp#L193
Additionally, ADIOS2 is using double-precision operations to calculate the new buffer size which can have large rounding issues if your variable (ADIOS speech for data) per MPI rank is more than 2GiB in size.btw: with ADIOS2 2.7.1 blosc compression is fixed. The blosc compressor will detect if the compressed buffer is to large and simply disable compression for the variable and store the data uncompressed.
I need ro correct myself double it large enough to store precisely over a terabyte.
The solution seems to be to use ADIOS2 2.7.1 with blosc compression in the future, or no compression at all. Can this issue be closed or is there anything more to say?
Actually, yes, I have another question. @psychocoderHPC you wrote
yes disabling compression will help in your case
I started PIConGPU with
mpiexec /<PATH-TO>/picongpu -d 2 4 2 -g 608 864 1056 -s 1 --autoAdjustGrid off -m --windowMovePoint 1.0 --openPMD.period 3000 --openPMD.source 'species_all,fields_all' --openPMD.file simOutput --openPMD.ext bp
assuming that there is no attempt to compress data if there is no json configuration string.
Now I receive the following error message
terminate called after throwing an instance of 'std::_Nested_exception<std::runtime_error>'
what(): ERROR: buffer overflow when resizing to 36196113975 bytes, in aggregation, when resizing receiving buffer to size 36196113975
with a traceback to libopenPMD.so and libadios2_cxx11.so.2.
The full error message is attached.
stderr.txt
What is the source of this error?
Without simulation output, i.e. without the --openPMD options, the simulation runs without errors.
Also, the simulation runs on the fwkt_v100 partition @ HZDR with more than enough host memory.
ERROR: buffer overflow when resizing to 36196113975 bytes, in aggregation, when resizing receiving buffer to size 36196113975
Unluckily ADIOS2 is using internally std::vector if You have 48GiB main memory for your process and the internal ADIOS2 buffer must grow from 28GiB to 32GiB the required memory will not fit into the main memory. That's a big issue and somehow means you can only use 1/2 of your main memory for ADIOS. If I remember correctly we talked about this issue in the past with the ADIOS dev's but I do not remember what they said.
ADIOS1 was a little bit different because for ADIOS1 we first calculated the memory requirements in PIConGPU and then called an ADIOS1 function to preserve the correct amount.
Is it possible to "hack" it with manually setting a buffer size that is sufficient for a given simulation? I see InitialBufferSize that may be for that, but I am not sure
Is it possible to "hack" it with manually setting a buffer size that is sufficient for a given simulation? I see
InitialBufferSizethat may be for that, but I am not sure
Yes could be that this is the solution but would require that the count the required memory before dumping like we did in the past with ADIOS1. This was not very fast because it requires an emulation step before dumping to found out how many particles and helper field we need.
I opened an ADIOS2 issue to see what the ADIOS developer suggest and if my interpretation of the error is correct. https://github.com/ornladios/ADIOS2/issues/2629
But couldn't I just give an initial buffer size that is as large as my (total) GPU memory (on the node)? Does it hurt when the buffer is too large in the beginning?
(Having never used this kind of output in PIConGPU) I think it should be fine as long as you don't run out of host memory for all active buffers combined.
@franzpoeschel Is openPMD-api by default enabling ADIOS2 aggregation instead of writing one file per process?
How can we configure this behavior in OpenPMD-api?
I looked to the implementation of the OpenPMD plugin for PIConGPU and can not find anything about it.
Do we need to configure it with --openPMD.json, if so what is the JSON string to disable aggregation?
Is it possible to "hack" it with manually setting a buffer size that is sufficient for a given simulation? I see
InitialBufferSizethat may be for that, but I am not sure
I think this will not solve the issue because the problem is comming from the adios aggregation.
Is it possible to "hack" it with manually setting a buffer size that is sufficient for a given simulation? I see
InitialBufferSizethat may be for that, but I am not sureI think this will not solve the issue because the problem is comming from the adios aggregation.
Hmm it did not solve the issue. Slurm kills the job when I set the initial buffer size to 30GB.
Thanks for the input @ all. With the following openPMD snippet in my simulation's *.cfg, I can at least write the initial simOutput_000000.bp
TBG_ADIOS2_configuration="'{
\"adios2\": {
\"engine\": {
\"type\": \"file\",
\"parameters\": {
\"BufferGrowthFactor\": \"1.5\",
\"InitialBufferSize\": \"16GB\",
\"NumAggregators\": \"!TBG_tasks\"
}
}
}
}'"
TBG_openPMD="--openPMD.period 3000 \
--openPMD.source 'species_all,fields_all' \
--openPMD.file simOutput \
--openPMD.ext bp \
--openPMD.json !TBG_ADIOS2_configuration"
and I moved TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))" to the top of the cfg file.
I will test writing simulation data and checkpoints also at later times in the simulation in the upcoming days.
Using compression, too.
EDIT 2021-02-26: This still did not work completely, scroll down to the bottom to see working configuration.
and I moved TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))" to the top of the cfg file.
There is no need to move this to the TOP. With tbg you can use variables before you define a variable and assigned a value.
Yes could be that this is the solution but would require that the count the required memory before dumping like we did in the past with ADIOS1. This was not very fast because it requires an emulation step before dumping to found out how many particles and helper field we need.
I can imagine that some kind of auxiliary backend in openPMD or ADIOS2 would be helpful that does no IO at all but only counts some stats such as the required memory?
I tried a little around with the configuration of ADIOS2. But simulations still stop with error messages. Most notably, I receive a buffer overflow / segmentation fault. The error message is
Unhandled exception of type 'St17_Nested_exceptionISt13runtime_errorE' with message 'ERROR: buffer overflow when resizing to 23267366457 bytes, when resizing buffer to 23267366457bytes, in call to PerformPuts
', terminating
[cupla] Error: <picongpu/include/pmacc/../pmacc/memory/buffers/HostBufferIntern.hpp>:71
[cupla] Error: <picongpu/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69
.
. # many more messages repeating the last two lines
.
[gv015:67954] *** Process received signal ***
[gv015:67954] Signal: Segmentation fault (11)
[gv015:67954] Signal code: Address not mapped (1)
[gv015:67954] Failing at address: 0x18b9
[gv015:67954] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab879630]
[gv015:67954] [ 1] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(_ZSt28_Rb_tree_rebalance_for_erasePSt18_Rb_tree_node_baseRS_+0x4aa)[0x2aaaae160c1a]
[gv015:67954] [ 2] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu(_ZN16cupla_cuda_async13cuplaFreeHostEPv+0x80)[0x937930]
[gv015:67954] [ 3] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu(_ZN5pmacc6BufferINS_4math6VectorIfLi1ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EED2Ev+0x11)[0x7a7181]
[gv015:67954] [ 4] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu(_ZN5pmacc10GridBufferINS_4math6VectorIfLi1ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3ES7_ED1Ev+0x2b7)[0x7a7ac7]
[gv015:67954] [ 5] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu(_ZThn48_N8picongpu8FieldTmpD0Ev+0x68)[0x7a8628]
[gv015:67954] [ 6] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu(_ZN5pmacc13DataConnectorD2Ev+0x152)[0x84fd92]
[gv015:67954] [ 7] /lib64/libc.so.6(+0x39ce9)[0x2aaaae8bace9]
[gv015:67954] [ 8] /lib64/libc.so.6(+0x39d37)[0x2aaaae8bad37]
[gv015:67954] [ 9] /lib64/libc.so.6(__libc_start_main+0xfc)[0x2aaaae8a355c]
[gv015:67954] [10] 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu[0x71157f]
[gv015:67954] *** End of error message ***
I started the simulation with the following mpi call
mpiexec 007_16gpus_only-openPMD-plugin_w-Buffer-configuration_w-AggregatorRatio-1_2GB-BufferSize/input/bin/picongpu -d 2 4 2 -g 608 864 1056 -s 30000 --autoAdjustGrid off -m --windowMovePoint 1.0 --openPMD.period 3000 --openPMD.source 'species_all,fields_all' --openPMD.file simOutput --openPMD.ext bp --openPMD.infix '_%T' --openPMD.json '{ "adios2": { "engine": { "type": "file", "parameters": { "BufferGrowthFactor": "1.1", "InitialBufferSize": "2GB", "AggregatorRatio" : "1" } } } }'
using the "AggregatorRatio" configuration option suggested by @pnorbert in ornladios/ADIOS2#2629.
I orginally expected, that the simulation was killed by the job system due to running out-of-memory, as suggested by the message
--------------------------------------------------------------------------
mpiexec noticed that process rank 7 with PID 99627 on node gv012 exited on signal 9 (Killed).
--------------------------------------------------------------------------
in stdout.
But asking the admin already revealed that this is not the case. There is no reason given or suggested for killing the job in the SLURM log messages.
At the moment, I only know from the time that it happened while writing simulation output. And I doubt that this is related to a running out-of-memory problem. A node has 378GB host memory available. The four V100 in a node have 32Gb memory each. That is, even if all information from all gpus within a node is copied to the host memory (=128Gb), there is more than twice that amount still free on the host memory. Also, by using the above AggregatorRatio setting, aggregation of information from gpus located in other nodes should not take place. However, in contrast to the previous error message, this message does not mention aggregation anyways.
Can anyone help?
Wildly guessing, I ask myself if this is a problem of a wrong int type when requesting memory...
@steindev For memory usage, I think you should also consult the different data preparation strategies that the openPMD plugin uses, see here for the two strategies that are available (search for dataPreparationStrategy. For the strategy doubleBuffer:
This strategy requires at least 2x the GPU main memory on the host side.
Together with the serialization buffer that ADIOS2 uses, that is 3x the GPU main memory, which might actually exhaust your available main memory on such a node. You could try --openPMD.dataPreparationStrategy mappedMemory at the cost of more calculation time to reduce memory usage.
Did you set InitialBufferSize as an ADIOS parameter?
@steindev You set in your tests "InitialBufferSize": "2GB" from your error message you can see that it is far too small. In that case you will trigger the same slowly resize of the internal buffer and hit the 50% issue if ADIOS2.
Please set "InitialBufferSize": "45GB" to avoid any internal buffer resize in ADIOS2.
@steindev You set in your tests
"InitialBufferSize": "2GB"from your error message you can see that it is far too small. In that case you will trigger the same slowly resize of the internal buffer and hit the 50% issue
Please set"InitialBufferSize": "45GB"to avoid any internal buffer resize in ADIOS2.
I tried this earlier and got killed by the batch system. Therefore I reduced it again, which let my simulation run a little longer, probably until there were more particles due to increasing density.
However, with the hint given by @franzpoeschel to use --openPMD.dataPreparationStrategy mappedMemory in combination with "InitialBufferSize": "45GB", the simulation runs longer than ever before and manages to write simulation data also at highest gas density :tada:
I will write if it continues and post the final configuration :monocle_face:
Take care --openPMD.dataPreparationStrategy mappedMemory can make your IO very slow. I am not sure if this is also true for V100 with Nvlink.
I tried this earlier and got killed by the batch system. Therefore I reduced it again, which let my simulation run a little longer, probably until there were more particles due to increasing density.
If you get killed by the batch system because you set the initial buffer to 45GiB then it could be that your tpl is wrong and the way how we allocate the memory is not currect.
@steindev Which system do you use. I am now a little bit confused because the original issue was about k80 but you reported something about V100.
@steindev Which system do you use. I am now a little bit confused because the original issue was about k80 but you reported something about V100.
Indeed, I am running all the time on V100 at hemera. @PrometheusPi, who opened the issue, ran on k80.
Take care
--openPMD.dataPreparationStrategy mappedMemorycan make your IO very slow. I am not sure if this is also true for V100 with Nvlink.
Thanks for the warning! For the simulation I did now, using only 16 GPUs, I could not observe slow down. On the contrary, output times reduced by a factor of 2 with respect to outputs written with --openPMD.dataPreparationStrategy doubleBuffer (for the first 4 outputs where it was still possible to write with doubleBuffer). bp-Folder size really is equal, I checked.
For future reference: Final configuration in my job's TBG-*.cfg file that worked
# Total memory of single gpu + reserve, to define ADIOS2 initial buffer size
TBG_initialBufferSize="45GB" # Nvidia Tesla V100 SXM2 32GB = 32510MiB
TBG_ADIOS2_configuration="'{ \
\"adios2\": { \
\"engine\": { \
\"type\": \"file\", \
\"parameters\": { \
\"BufferGrowthFactor\": \"1.1\", \
\"InitialBufferSize\": \"!TBG_initialBufferSize\", \
\"AggregatorRatio\" : \"1\" \
} \
} \
} \
}'"
TBG_openPMD="--openPMD.period 3000 \
--openPMD.source 'species_all,fields_all' \
--openPMD.file simOutput \
--openPMD.ext bp \
--openPMD.infix '_%T' \
--openPMD.dataPreparationStrategy mappedMemory \
--openPMD.json !TBG_ADIOS2_configuration"
TBG_plugins="!TBG_openPMD" #add your other plugins too!
On the contrary, output times reduced by a factor of 2 with respect to outputs written with
--openPMD.dataPreparationStrategy doubleBuffer
I assume that this is due to giving the correct buffer size of 45GB. Using doubleBuffer would still be faster, but the slowdown through resizing is even worse.
On the contrary, output times reduced by a factor of 2 with respect to outputs written with
--openPMD.dataPreparationStrategy doubleBufferI assume that this is due to giving the correct buffer size of 45GB. Using doubleBuffer would still be faster, but the slowdown through resizing is even worse.
Yes, you are right. The initial Buffer size in the older simulation was 2GB.
In principle the issue is solved. But there are two remaining questions:
Yes could be that this is the solution but would require that the count the required memory before dumping like we did in the past with ADIOS1. This was not very fast because it requires an emulation step before dumping to found out how many particles and helper field we need.
I can imagine that some kind of auxiliary backend in openPMD or ADIOS2 would be helpful that does no IO at all but only counts some stats such as the required memory?
I tried this earlier and got killed by the batch system. Therefore I reduced it again, which let my simulation run a little longer, probably until there were more particles due to increasing density.
If you get killed by the batch system because you set the initial buffer to 45GiB then it could be that your tpl is wrong and the way how we allocate the memory is not correct.
I think (ii) is explained by the fact, that I used dubble Buffering which together with the Adios' serialization buffer was larger for all gpus in a node than the nodes memory.
What about (i)? Should I open a new issue for that?
You can do that if you like, yeah. I think there should also be some discussion on our side (1) whether this is something that we would use and (2) how to integrate it into our workflows (so that you wouldn't need to use such a thing manually).
I think I'll also bring up the topic in the weekly VC that I have with the ADIOS2 developers and Axel Hübl's team.
Issue opened in openPMD/openPMD-api#937. Closing this.
@franzpoeschel @steindev We should think about flushing openPMD series explicit during we output the data. We can collect how much data we give to openPMD API and can call flush() if we reach a useful threshold.g. a few Gigabyte.
@franzpoeschel Is it possible to flush the series without breaking the streaming. The flush should not mark that a simulation timestep is over. It should only push data away that the memory footprint is not increasing too much.
Since we can create unlimited derived data from our simulation data we need such a mechanism anyway because currently, we assume that we can somehow hold all data we will dump into the main memory.
@franzpoeschel @steindev We should think about flushing openPMD series explicit during we output the data. We can collect how much data we give to openPMD API and can call
flush()if we reach a useful threshold.g. a few Gigabyte.@franzpoeschel Is it possible to flush the series without breaking the streaming. The flush should not mark that a simulation timestep is over. It should only push data away that the memory footprint is not increasing too much.
Intermittent flushes are already happening. A streaming step is ended in openPMD by Iteration::close(), you can flush however often you like. Our current flushing strategy is guided by our reuse of PIConGPU buffers (i.e. we flush very often so we can reuse buffers, but flushing often is inefficient in ADIOS2).
@franzpoeschel As I understand the flush from openPMD-api it is only guaranteeing that the pointer passed to datasets can be used again. This means ADIOS is free to keep all data in a temporary buffer until we destroy the adios writer object.
Based on the error @steindev reported I assume that ADIOS is not writing data to the disk when we call flush.
Is my understanding correct that openPMD-api is calling engine.Close(); from ADIOS2 only when Iteration::close() from openPMD-api is called? In that case this would explain the behavior we saw.
@psychocoderHPC There are two ways to make ADIOS2 dump data from main memory to disk:
Iteration::close() will do.Iteration::close() will do that in streaming mode.Other than that, we can only make ADIOS2 consume the buffers that we give it, but not instruct ADIOS2 to put that data on disk already. That is what happens upon Series::flush(). So, yes your understanding is correct.
I will tomorrow open an ADIOS issue if there is any other way to write data to disk. In the worst case we need to close and open the engine.
Think about a system e.g. taurus from TUD where each node has only twice or less the memory of GPUs memory. Somehow openPMD should provide a way to write data large than the main memory of the node.
Can I close a openPMD-api series and open it again to write additional data?
close and open the engine
Not really an ADIOS feature. You can append new steps to an existing ADIOS file, but you can't modify an existing one.
Can I close a openPMD-api series and open it again to write additional data?
Only in HDF5 (and JSON I think).
I will tomorrow open an ADIOS
Relevant issue for that: https://github.com/ornladios/ADIOS2/issues/1891#issuecomment-783523912.
Short version: ADIOS2 has a method Engine::Flush():
/**
* Manually flush to underlying transport to guarantee data is moved
* @param transportIndex
*/
void Flush(const int transportIndex = -1);
… but it behaves weird and we don't use it.
Long version: Read the comment.
Realistically, the best solution would be to use a more flexible data structure than a single vector<char> in ADIOS, but I don't think that will happen.
This is a "BUG" in ADIOS2. The implementation of the compressor in ADIOS2 differs from ADIOS1.
If you use BZIP2 the internal ADIOS2 buffer will be 10% larger then the chunk to compress
If you have data that can not be well compressed by bzip2 it could be that the compressed data is larger than the internal ADIOS buffer. IN this case, you get this error: https://github.com/ornladios/ADIOS2/blob/3b4e3e04c30b51907386b24a672fbbc752935cea/source/adios2/operator/compress/CompressBZIP2.cpp#L193
Additionally, ADIOS2 is using double-precision operations to calculate the new buffer size which can have large rounding issues if your variable (ADIOS speech for data) per MPI rank is more than 2GiB in size.btw: with ADIOS2 2.7.1 blosc compression is fixed. The blosc compressor will detect if the compressed buffer is to large and simply disable compression for the variable and store the data uncompressed.
@psychocoderHPC What is the correct sintax to use this fix with openPMD json ? @ax3l got ready openPMD 0.13 with Adios 2.7.1 on Summit so I can get some I/O profiles with compression on.
@psychocoderHPC What is the correct sintax to use this fix with openPMD json ? @ax3l got ready openPMD 0.13 with Adios 2.7.1 on Summit so I can get some I/O profiles with compression on.
This should be the string you should pass to openpmd-api to enable blosc usage:
{
"adios2":{
"engine":{
"usesteps":true,
"parameters":{
"InitialBufferSize":"2Gb",
"Profile":"On"
}
},
"dataset":{
"operators":[
{
"type":"blosc",
"parameters":{
"clevel":"1",
"doshuffle":"BLOSC_BITSHUFFLE"
}
}
]
}
}
}
ADIOS2 currently do not have a documentation section for the possible options of an operator therefore you should inspect the code here
The JSON string should be passed to openPMD-api during the series creation. see the example in PIConGPU
@franzpoeschel Is there an openPMD-api example available on how to pass JSON to the openPMD-api backend? Could you please link the example.
That part is documented here @psychocoderHPC @benjha.
Most helpful comment
Realistically, the best solution would be to use a more flexible data structure than a single
vector<char>in ADIOS, but I don't think that will happen.