Picongpu: Unclear total GPU memory requirement

Created on 6 Mar 2019  ·  21Comments  ·  Source: ComputationalRadiationPhysics/picongpu

Recently I run a 3D simulation for nanofoil ion acceleration on k80. It got crashed directly after initialization (h5 file at t=0 is created, i didn't look inside but it was 71 GB)

The simulation box is 1152 * 3072 * 1152 for x, y, z with grid size of 20nm * 3nm * 20nm. The foil is fully ionized initially with a flat-top profile (thickness 25 nm, density 90n_c). Three species (carbon, hydrogen, electron) are imposed with 4 macroparticles per cell. The laser is initialized using the current of fieldBackground.param.

From the memory calculator, it requires about 270 GB. So I use 6 nodes (48 GPUs, 12 GB per GPU) arranged with 6 * 1 * 8 for 3D.

from picongpu.utils import memory_calculator

cell_size_y = 3e-9  # 3 nm
cell_size_x = 20e-9 # 20 nm

y0 = 3.2e-6  # position of foil surface (m)
y1 = y0 + 25.4e-9  # target thickness (m)
L = 0  # pre-plasma scale length (m)
L_cutoff = 0  # pre-plasma length (m)

# number of cells per device
Nx = 1152  # ~30um
Ny = 1024*3    # ~20um
Nz = Nx

vacuum_cells = (y0 - L_cutoff) / cell_size_y  # with pre-plasma: 221 cells
target_cells = (y1 - y0 + 2 * L_cutoff) / cell_size_y  # 398 cells

pmc = memory_calculator.MemoryCalculator(Nx, Ny, Nz)

target_x = Nx  # full transversal dimension of the device
target_y = target_cells  # only the first row of devices holds the target
target_z = Nz

# typical number of particles per cell which is multiplied later for
# each species and its relative number of particles
N_PPC = 4

# conversion factor to megabyte
megabyte = 1.0 / (1024 * 1024)

print("Memory requirement per device:")
# field memory per device
field_device = pmc.mem_req_by_fields(Nx, Ny, Nz, field_tmp_slots=2,
                                     particle_shape_order=2)
print("+ fields: {:.2f} MB".format(
      field_device * megabyte))

# electron macroparticles per supercell
e_PPC = N_PPC * (
    # H,C,N pre-ionization - higher weighting electrons
    1 \
    # electrons created from H ionization
    + (0 - 0) \
    # electrons created from C ionization
    + (0 - 0)
)
# particle memory per device - only the target area contributes here
e_device = pmc.mem_req_by_particles(
    target_x, target_y, target_z,
    num_additional_attributes=0,
    particles_per_cell=e_PPC
)
H_device = pmc.mem_req_by_particles(
    target_x, target_y, target_z,
    # no bound electrons since H is preionized
    num_additional_attributes=0,
    particles_per_cell=N_PPC
)
C_device = pmc.mem_req_by_particles(
    target_x, target_y, target_z,
    num_additional_attributes=1,
    particles_per_cell=N_PPC
)
#N_device = pmc.mem_req_by_particles(
#    target_x, target_y, target_z,
#    num_additional_attributes=1,
#    particles_per_cell=N_PPC
#)
# memory for calorimeters
#cal_device = pmc.mem_req_by_calorimeter(
#    n_energy=1024, n_yaw=360, n_pitch=1
#) * 2  # electrons and protons
#memory for random number generator states
rng_device = pmc.mem_req_by_rng(Nx, Ny, Nz)

print("+ species:")
print("- e: {:.2f} MB".format(e_device * megabyte))
print("- H: {:.2f} MB".format(H_device * megabyte))
print("- C: {:.2f} MB".format(C_device * megabyte))
#print("- N: {:.2f} MB".format(N_device * megabyte))
print("+ RNG states: {:.2f} MB".format(
    rng_device * megabyte))
#print(
#    "+ particle calorimeters: {:.2f} MB".format(
#        cal_device * megabyte))

mem_sum = (field_device + e_device + H_device + C_device + rng_device)
print("Required memory per device: {:.2f} MB".format(
    mem_sum * megabyte))

the simulation got crashed after a long time initilization (25min). Sorry that I accidentally deleted the whole folder including stderr file. But I remembered the error message is about cuda event and "std:runtime error". It seems like a memory allocation problem. the input can be found here
/home/wan11/wan11_external/my_antenna_3d_3nm_20nm

Could you guys help figure out what is the problem here ?

tools question

Most helpful comment

Thanks. This issue has been fixed already.

Richard Pausch notifications@github.com 于2019年11月7日周四 下午12:13写道:

@wany12 https://github.com/wany12 and @n01r https://github.com/n01r
are there any updates on this issue?
Is it still pending or has it been solved and can be closed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2916?email_source=notifications&email_token=ALQMUNTIMUC4CULJUI7GJY3QSPS5NA5CNFSM4G4EMFSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDL42PA#issuecomment-551013692,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ALQMUNT3RA7GGUXIHVJCA6LQSPS5NANCNFSM4G4EMFSA
.

All 21 comments

Dear Yang, welcome to the PIConGPU repository and thank you for your first issue! :tada:

I redid the calculation with the memory calculator.
Please be aware that we recommend to use it for a reference GPU with the highest expected memory load and not for the whole simulation box since that can be misleading!

Grid per GPU:
  Nx:  192.0
  Ny:  3072
  Nz:  144.0
Memory requirement per device:
+ fields: 4186.24 MB
+ species:
- e: 55.36 MB
- H: 27.68 MB
- C: 31.25 MB
+ RNG states: 1944.00 MB
Required memory per device: 6244.54 MB

I do not see any problems with this amount of used memory.

It is quite unfortunate that you do not have the stderr file anymore.
Once the k80s have a new .tpl file on hemera could you maybe try running it again but with reduced resources by just using one GPU or one node and removing GPUs in transversal direction?

Could you also please grant use read access to the files in my_antenna_3d_3nm_20nm?

$ chmod -R a+r my_antenna_3d_3nm_20nm

I would like to see the .param files as well as the .cfg file you used.

Hi, Marco,
I opened the permission for you to read. I used 48.cfg for this simulation.

Hm, I still get a Permission denied on input/include/.

P.S.: Please could you add your real name on your GitHub profile? - Just so that all of us know whom to connect with the username

Hey Marco, now we've managed to open the access to /bigdata/hplsim/external/wan11/my_antenna_3d_3nm_20nm -- had to use chmod -R 755 instead of a+r, no idea why..

Just another remark:

Please be careful when reducing the resolution transversally in an overdense case.
At 90nc the plasma wavelength is 84.33nm and transversally you're resolving that by just over 4 steps.
Also, in order to resolve the plasma oscillation well in time, a typical resolution of omega_p * dt <= 0.1 is recommended.
That would require a time step of 4.477e-18 seconds. This, in turn, would increase your spatial resolution a lot since you need to account for the CFL criterion.

I know that full 3D simulations of overdense targets require ridiculous amounts of resources. It is nevertheless okay to make simulations with lower resolution as long as you are aware that you might lose dynamics because of it.

@wany12 I just saw that you use a filter filterRNG that for every output time step randomly selects a tenth of your particles and writes them into the output.
That seems like a pretty expensive operation and maybe not necessary in this case. If you look at the memory calculation you can see that your particles take up some 100MB per GPU and so 4.8GB of your output are particles. The much larger part are the fields. You could use probe particles to reduce that field output or compress the fields afterwards.

It's admirable that you're aware of the memory that you write :+1:. Don't make it too complicated in the beginning, though.
I have also seen that your *.cfg file doesn't have a lot of synthetic diagnostics, yet.
Try employing more of the calorimeters, energy histograms, phase spaces and so on and set the limits to your expectations of particle energy.

Apart from the last remarks I do not see why your simulation crashed, yet.
As mentioned before I would like to ask you to just rerun it (maybe on one GPU as a quick test) once the new .tpls exist.

Thank you, Marco. I also think we need the stderr file. For the resolution check, I tested using 2D with different sizes of cells and seems that this is the upper limit to get similar results (3nm * 20nm) with possible minimal memory.

Once the new .tpl is created, I will redo the simulation with maybe one node, and feedback to you

k80.tpl is on the way #2919

Hi, I redo the 3D simulation using k80s on the new cluster hemera. This time I didn't use the particleFilter, but other all parameters are the same as the one before (6 nodes). After initilization, it runs for about 11200 time steps and crashed because of out of memory.

The stderr and stdout can be found in /bigdata/hplsim/external/wan11/my_antenna_3d_3nm20nm/

The main error information is terminate called after throwing an instance of 'std::runtime_error' what(): /bigdata/hplsim/external/software/picongpu/workshop/thirdParty/alpaka/include/alpaka/event/EventCudaRt.hpp(195) 'ret = cudaEventQuery( event.m_spEventImpl->m_CudaEvent)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

This is already 80% of your requested number of steps.

My first guess would be that perhaps your plasma behaved in a way that too many particles accumulated on one GPU. Could you please have a look at your outputs (your densities for instance) and see if this might be the case.

Your foils are thin and I would expect them to be blasted away and particles likely accumulate at the edge of the focal spot region.

You can also have a look at simData_<step>.h5/data/<step>/particles/<species>/particlePatches/ and use the quantities offset and numParticles to find out out how many particles of each species were on which GPU.

ping @wany12

any news?

Hi, Marco. sorry that respond so late. I went back to China last week.
Ok, I just checked the last file, and following is the result (for particles, I only output the protons)

print(f['data/11000/particles/H_all/particlePatches/numParticles'].value)
[885032 890210 895983 893624 895433 895829 890640 885209 891724 901912
869211 863534 861383 868051 899391 891647 903112 869719 867758 887386
888885 867400 876015 892166 903182 867868 870097 884699 882642 872127
871877 893061 893214 899719 862565 863937 863675 861828 897941 892706
885130 890936 898854 893760 891969 897716 890972 885039]

print(f['data/11000/particles/H_all/particlePatches/offset/z'].value)
[ 0 144 288 432 576 720 864 1008 0 144 288 432 576 720
864 1008 0 144 288 432 576 720 864 1008 0 144 288 432
576 720 864 1008 0 144 288 432 576 720 864 1008 0 144
288 432 576 720 864 1008]

print(f['data/11000/particles/H_all/particlePatches/offset/x'].value)
[ 0 0 0 0 0 0 0 0 192 192 192 192 192 192 192 192 384 384
384 384 384 384 384 384 576 576 576 576 576 576 576 576 768 768 768 768
768 768 768 768 960 960 960 960 960 960 960 960]

print(f['data/11000/particles/H_all/particlePatches/offset/y'].value)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]

Could you tell how to understand the offset?

Hi, Marco. sorry that respond so late. I went back to China last week.

Don't worry :) just wanted to check that you're not stuck with your problem.

The offset shows you the cell coordinate of the origin cell of the GPU with respect to the origin cell of the simulation box.
So you can understand it as follows:

| GPU # | start x | start y | start z |
|-----------|----------|---------|----------|
| GPU0 | 0 | 0 | 0 |
| GPU1 | 0 | 0 | 144 |
| ... ||||
| GPU8 | 192 | 0 | 0 |
| ... ||||
| GPU47 | 960 | 0 | 1008 |

So from the offset variable you get the domain decomposition in your 6 x 1 x 8 = 48 GPU setup.

However, since you only write out the protons we probably won't see why it crashed.
Do you at least also have the electron density?
You could add up all cells for each GPU and then get an arbitrary value of "density" and if you didn't work with a pre-plasma (so with different particle weightings) or with ionization (weighting of electrons might otherwise differ between the initally free and newly produced ones) you also get an idea of where particles accumulated the most.

Hi, Marco, I add up the density for each GPU, and the initial value is around 3.3e34. But at the time step before it crashed, the maximum value is 3.8e34. It seems not a big difference.

Sorry, I'm not sure if I understand correctly - did you sum up all densities of all GPUs to one number or did you look at each GPU separately and sum it up? The latter is what I would suggest.

In any case - do not hesitate to add some more output and try again.
You can also make use of checkpoints to restart.

It is a bit too little information to say why it crashed.
Just try again, continue with your parameter scan and do not forget to add also more synthetic output once you learn which ranges of energies for which particles you are interested in. :)

Hi, Marco, I was in the SPIE conference last week and sorry to respond so late. Actually what I did is for each GPU separately (the latter one).
I will check later with the problem
Thank you for your help again

@wany12 and @n01r are there any updates on this issue?
Is it still pending or has it been solved and can be closed?

Thanks. This issue has been fixed already.

Richard Pausch notifications@github.com 于2019年11月7日周四 下午12:13写道:

@wany12 https://github.com/wany12 and @n01r https://github.com/n01r
are there any updates on this issue?
Is it still pending or has it been solved and can be closed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2916?email_source=notifications&email_token=ALQMUNTIMUC4CULJUI7GJY3QSPS5NA5CNFSM4G4EMFSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDL42PA#issuecomment-551013692,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ALQMUNT3RA7GGUXIHVJCA6LQSPS5NANCNFSM4G4EMFSA
.

Was this page helpful?
0 / 5 - 0 ratings