Picongpu: ADIOS buffer problem

Created on 31 Jul 2018 · 25Comments · Source: ComputationalRadiationPhysics/picongpu

Hi!
I have a problem with buffering with adios/1.13.1.
Here is part of the error message:

and path to the files
/bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=3.0_1ps.
I tried just recompiling - resubmitting once and it didn't help.

This simulation was restarting automatically as was discussed in https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-394273964
and for some time it worked perfectly but then that error appeared.

affects latest release bug plugin

Source

NastasiaM

Most helpful comment

OK, I commented this line, compiled, submitted and now I'm waiting for something to crash or not to crash. I will write you when see something

NastasiaM on 1 Aug 2018

👍2

All 25 comments

The error message seems to originate from here in the adios code.

from common_adios.c

  if (*total_size > fd->buffer_size && fd->bufstate == buffering_ongoing) {
    if (adios_databuffer_resize(fd, *total_size)) {
      log_warn("Cannot reallocate data buffer to %" PRIu64
               " bytes "
               "for group %s in adios_group_size(). Continue buffering "
               "with buffer size %" PRIu64 " MB\n",
               *total_size, fd->group->name, fd->buffer_size / 1048576L);
    }
  }

The calculation of the adios group buffer size is done around here in PIConGPU.

from ParticleAttributeSize.hpp

...
params->adiosGroupSize += elements * components * sizeof(ComponentType);
...

n01r on 31 Jul 2018

Any ideas @psychocoderHPC ?

n01r on 31 Jul 2018

@NastasiaM Do you run the job on our k20 or k80? It looks like the host is running out of CPU memory.

psychocoderHPC on 1 Aug 2018

On k20

NastasiaM on 1 Aug 2018

We probably just miss some attributes in the buffer estimation.
Can you change this line:

https://github.com/ComputationalRadiationPhysics/picongpu/blob/c3054c36ef5cc482233b4779fc9d8dbaed8d1967/src/picongpu/include/plugins/adios/ADIOSWriter.hpp#L1056

size_t buffer_mem=static_cast<size_t>(1.2 * static_cast<float_64>(writeBuffer_in_MiB));

It looks like the host is running out of CPU memory.

Could also be the case!

ax3l on 1 Aug 2018

I honestly have no experience in this matter, so it very well maybe nonsense. But doesn't the provided error message indicate that the buffer size we request is more than max allowed in ADIOS? If so, it can probably be fixed by capping params->adiosGroupSize at 7 MB. It seems currently there is no hard capping for this value.

sbastrakov on 1 Aug 2018

Don't forget to recompile after the change and replace the old binary in your run directory with the new <paramSet>/bin/picongpu binary if you try @ax3l's solution. (You can also rename the old one)
That is because you're restarting automatically from the same simulation directory.

n01r on 1 Aug 2018

@sbastrakov that's exactly the lines I am linking, using the var from threadParams->adiosGroupSize.

But it looks to me that the malloc call inside adios_databuffer_resize fails, indicating missing mem on the CPU as @psychocoderHPC stated.
https://github.com/ornladios/ADIOS/blob/v1.13.1/src/core/buffer.c#L58

ax3l on 1 Aug 2018

Uh, it's actually a hard-coded param in ADIOS that we cap:
https://github.com/ornladios/ADIOS/blob/v1.13.1/src/core/buffer.c#L64

https://github.com/ornladios/ADIOS/blob/v1.13.1/src/core/buffer.c#L101-L104

https://github.com/ornladios/ADIOS/blob/v1.13.1/src/core/buffer.c#L30-L38

We set this var with the lines I linked above:
https://github.com/ComputationalRadiationPhysics/picongpu/blob/c3054c36ef5cc482233b4779fc9d8dbaed8d1967/src/picongpu/include/plugins/adios/ADIOSWriter.hpp#L1052-L1058

ax3l on 1 Aug 2018

We could also just comment out the adios_set_max_buffer_size(buffer_mem); - it's a large default value now and optional.

ax3l on 1 Aug 2018

We could also just comment out the adios_set_max_buffer_size(buffer_mem); - it's a large default value now and optional.

yes I think that should work. It was only required in very old ADIOS versions.
@NastasiaM could you please comment out the line @ax3l suggest and recompile and give us some feedback if it works or not.

psychocoderHPC on 1 Aug 2018

OK, I commented this line, compiled, submitted and now I'm waiting for something to crash or not to crash. I will write you when see something

NastasiaM on 1 Aug 2018

👍2

@NastasiaM how did it work?

ax3l on 7 Aug 2018

Looks like everything is fine now. Thanks!

NastasiaM on 7 Aug 2018

🎉1

Ok, will update the mainline then!

ax3l on 7 Aug 2018

Is there any chance that we could crash later again if we, for instance, create a species with a very large number of attributes?

n01r on 7 Aug 2018

Nope, nothing to worry.

the max size is sane since ADIOS 1.10 and we are not allocating the buffer expliclitly anymore (via adios_groupsize) already.
Ref: https://github.com/ornladios/ADIOS/commit/9fae0d12fea535a4366e81c0c04d7a3be54c3179

ADIOS re-allocates on adios_write on the fly

ax3l on 7 Aug 2018

👍1

Hi!
Now I have another problem - it looks like at some point simulation output cannot be written, and simulation is stuck. It should restart after each cycle (about 1 hour) but now it does.t restart and does.t give any output. And no error messages also! Any ideas what it could be?

NastasiaM on 9 Aug 2018

$ df -h /bigdata/hplsim/production/
Filesystem      Size  Used Avail Use% Mounted on
bigdata         725T  680T   46T  94% /bigdata

We wouldn't be running into quota issues, would we?

n01r on 9 Aug 2018

The softlimit for external users is 80T but I can't check because I don't have permission.

NastasiaM on 9 Aug 2018

$ df -h /bigdata/hplsim/external/
Filesystem      Size  Used Avail Use% Mounted on
bigdata          85T   74T   12T  87% /bigdata

At first I thought ../external used the same space as ../production but I don't think that's true.
Well, the quota command doesn't work. But you can always do a du -hs <your/parent/directory>. It takes a while but you can check that way.

n01r on 9 Aug 2018

Can you qdel the simulation and check stderr/stdout? Sounds to me like running out of memory during the sim.

ax3l on 9 Aug 2018

Done - nothing new in stderr/stdout. Should I send it to you?

NastasiaM on 9 Aug 2018

And I checked disk usage for myself - 8.2 T out of 10 T which is my limit

NastasiaM on 9 Aug 2018

this issue is solve with #2670

@NastasiaM I would close this issue be cause the initial problem is solved. If you have other issues please open a issue per problem that we not mix different topics. It is no problem to work on different issues on the same time.

psychocoderHPC on 9 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Memory consumption

cbontoiu · 3Comments

Field-Ionization: Damp E-Field

ax3l · 4Comments

Momentum in outputs (most probably) wrong

HighIander · 4Comments

Pusher: Structure-preserving, E-Field Compensating

ax3l · 4Comments

What's up with Axl's mustache?

bussmann · 4Comments