Picongpu: .tpl and picongpu.profile on a single node

Created on 7 Oct 2020 · 21Comments · Source: ComputationalRadiationPhysics/picongpu

Hi,

I recently come across a single node and multiple GPUs system. How can I set up these files? For examples the system has:

# COMPUTE NODES
NodeName=thor Gres=gpu:16 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=1546812
......
NodeName=thor Name=gpu File=/dev/nvidia0 CPUs=0-11,24-35
NodeName=thor Name=gpu File=/dev/nvidia1 CPUs=0-11,24-35
...
NodeName=thor Name=gpu File=/dev/nvidia8 CPUs=12-23,36-47
NodeName=thor Name=gpu File=/dev/nvidia9 CPUs=12-23,36-47
...

I am preparing to use NodeName=thor. I took the example of fwkt_v100.tpl, especially I am confused about these lines:

# number of available/hosted GPUs per node in the system
.TBG_numHostedGPUPerNode=4

# required GPUs per node for the current job
.TBG_gpusPerNode=`if [ $TBG_tasks -gt $TBG_numHostedGPUPerNode ] ; then echo $TBG_numHostedGPUPerNode; else echo $TBG_tasks; fi`

# host memory per gpu
.TBG_memPerGPU="$((378000 / $TBG_gpusPerNode))"
# host memory per node
.TBG_memPerNode="$((TBG_memPerGPU * TBG_gpusPerNode))"

# number of cores to block per GPU - we got 6 cpus per gpu
#   and we will be accounted 6 CPUs per GPU anyway
.TBG_coresPerGPU=6

# We only start 1 MPI task per GPU
.TBG_mpiTasksPerNode="$(( TBG_gpusPerNode * 1 ))"

# use ceil to calculate nodes
.TBG_nodes="$((( TBG_tasks + TBG_gpusPerNode - 1 ) / TBG_gpusPerNode))"

I am not sure how to set .TBG_coresPerGPU as they are shared among them and how to get the correct number in .TBG_memPerGPU.

On the other hand, there are two lines in .profile with
srun --time=0:30:00 --nodes=$numNodes --ntasks-per-node=48 --cpus-per-task=1 --gres=gpu:16 --mem=32000 -p gpu -A gpu --pty bash

I would be glad if someone can show how can this be set. Thanks

cuda machinsystem question

Source

StevE-Ong

Most helpful comment

Hi @sbastrakov, I am using SLRUM.

A quick update!...I think a setting is not right.

I uncomment SBATCH --cpus-per-task=!TBG_coresPerGPU in gpu.tpl and it now run. In stderr file, it wrote

==> Error: Spec '[email protected]%[email protected]~adios+hdf5~isaac+png backend=cuda cudacxx=nvcc arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected] arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected] arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected]...........

But the stderr gives

Running program...
using default compiler
PIConGPU: 0.5.0
  Build-Type: Release

Third party:
  OS:         Linux-5.3.0-51-generic
  arch:       x86_64
  CXX:        GNU (7.5.0)
  CMake:      3.18.2
  CUDA:       10.2.89
  mallocMC:   2.3.1
  Boost:      1.70.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (3.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      NOTFOUND
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 2097152
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time:  1min 10sec 307msec = 70 sec
  0 % =        0 | time elapsed:                   24msec | avg time per step:   0msec
  5 % =      200 | time elapsed:             2sec 751msec | avg time per step:  13msec
......
100 % =     4000 | time elapsed:       1min 29sec 619msec | avg time per step:  23msec
calculation  simulation time:  1min 29sec 666msec = 89 sec
full simulation time:  2min 46sec 136msec = 166 sec

I get some output in the simOutput. I have not tried other output such as hdf5 files
To be sure I will continue to run a few times in the next few days and write an update here. Thanks again @sbastrakov.

StevE-Ong on 12 Oct 2020

❤1 👍1

All 21 comments

Hello @StevE-Ong ,

Both of these variables are used to account for memory and core hours usage in case you use not the full node, but a part of it. Thus, it is specialized per GPU and then multiplied by the number of GPUs requested. This is of course system-dependent, e.g. some systems count like that, and some always count for the full node. (Such information is normally given at system documentation, or via a system admin.) In the latter case, one always allocates the full node and then you can just directly set .TBG_memPerNode to the correct value (not as a product of memory per GPU and number of GPUs) and remove or ignore .TBG_memPerGPU. Regarding the .TBG_coresPerGPU, it is used as a SLURM --cpus-per-task value (in SLURM speech tasks are MPI processes). Then PIConGPU should work with 1 as well, that profile sets 6 as the system in question accounts for 6 anyways.

sbastrakov on 7 Oct 2020

With these settings you more or less end up with the srun command parameters in the end of your post.

sbastrakov on 7 Oct 2020

@sbastrakov Thank you for your reply. I tried the following:

# number of available/hosted GPUs per node in the system
.TBG_numHostedGPUPerNode=16

# required GPUs per node for the current job
.TBG_gpusPerNode=`if [ $TBG_tasks -gt $TBG_numHostedGPUPerNode ] ; then echo $TBG_numHostedGPUPerNode; else echo $TBG_tasks; fi`

# host memory per node
.TBG_memPerNode=1500000

.TBG_coresPerGPU=3 ## <--- 48 cpus in total and shared by 16 gpus

# We only start 1 MPI task per GPU
.TBG_mpiTasksPerNode="$(( TBG_gpusPerNode * 1 ))"

# use ceil to calculate nodes
.TBG_nodes=1

I then run the LWFA example with 4.cfg and it returns:

no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test
/var/spool/slurm/d/job00127/slurm_script: line 99: mpiexec: command not found

So I look at the last few lines of the tpl file it says:

# test if cuda_memtest binary is available and we have the node exclusive
if [ -f !TBG_dstPath/input/bin/cuda_memtest ] && [ !TBG_numHostedGPUPerNode -eq !TBG_gpusPerNode ] ; then
  # Run CUDA memtest to check GPU's health
  mpiexec !TBG_dstPath/input/bin/cuda_memtest.sh
else
  echo "no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test" >&2
fi

It seems that !TBG_numHostedGPUPerNode -eq !TBG_gpusPerNode is necessary. So I use 16.cfg...all the gpus and it returns
stderr.txt

I guess this time it may have to do with the memory where it is incorrectly set OR the compilation itself has a problem? I use spack to install dependencies and I can compile the LWFA example after deleting the #include <pmacc/meta/XXX> lines in speciesDefinition.param.

StevE-Ong on 8 Oct 2020

👍1

Hello @StevE-Ong .

This output

no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test
/var/spool/slurm/d/job00127/slurm_script: line 99: mpiexec: command not found

does not concern PIConGPU itself. It is just a sanity check that we do before when a user runs on a fully allocated node. So you can simply ignore it. Btw we improved the clarity of this message in the dev branch, so that it is not understood as a PIConGPU error.

The fact that you needed to remove smth from the example to compile suggests that your spack source codes and your example setup come from different versions of PIConGPU. This is very fragile and could lead to various issues. Could you please double-check that the same version is used?

Regarding the stderr attached, it seems there is a mismatch between versions. This normally originates from having the same dependency via spack and another version of this dependency (probably CUDA in this case) via a system module. To investigate this, you could manually check for it and/or attach your full .profile file used, and the output of

.build/picongpu -h
.build/picongpu -v

when run from the setup directory after PIConGPU was built there (so after pic-build completed)

sbastrakov on 8 Oct 2020

Thanks @sbastrakov

The version of picongpu is 0.4.3, which is loaded from spack.

Package:   picongpu

Description:
    PIConGPU: A particle-in-cell code for GPGPUs

Homepage: https://github.com/ComputationalRadiationPhysics/picongpu

Maintainers: @ax3l

Tags: 
    None

Preferred version:  
    0.4.3        https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.3.tar.gz

Safe versions:  
    develop      [git] https://github.com/ComputationalRadiationPhysics/picongpu.git on branch dev
    0.4.3        https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.3.tar.gz
    0.4.2        https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.2.tar.gz
    0.4.1        https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.1.tar.gz
    0.4.0-rc4    https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.0-rc4.tar.gz
    0.4.0-rc3    https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.0-rc3.tar.gz
    0.4.0-rc2    https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.0-rc2.tar.gz
    0.4.0        https://github.com/ComputationalRadiationPhysics/picongpu/archive/0.4.0.tar.gz
    local        [git] file:///data/storage/jfong/src/picongpu
    gtc18        [git] https://github.com/ax3l/picongpu.git on branch topic-NGCandGTC18
    foilISAAC    [git] https://github.com/ax3l/picongpu.git on branch topic-20171114-foilISAAC

Variants:
    Name [Default]    Allowed values    Description
    ==============    ==============    ======================================

    adios [off]       on, off           Enable the ADIOS plugin
    backend [cuda]    cuda, omp2b       Control the computing backend
    cudacxx [nvcc]    nvcc, clang       Device compiler for the CUDA backend
    hdf5 [on]         on, off           Enable multiple plugins requiring HDF5
    isaac [off]       on, off           Enable the ISAAC plugin
    png [on]          on, off           Enable the PNG plugin

Installation Phases:
    install

Build Dependencies:
    adios  boost  cmake  cuda  isaac  libsplash  pngwriter  zlib

Link Dependencies:
    adios  boost  cuda  isaac  libsplash  openmpi  pngwriter  zlib

Run Dependencies:
    cmake  isaac-server  openmpi  rsync  util-linux

Virtual Packages: 
    None

But the examples are from the file I git clone from the GitHub link because I am not sure how to copy the example that loaded from spack. The spack version is 0.15.4-926-9c8cfcca0.

The file compile.output, .build/picongpu -v, .build/picongpu -h, and .profile are attached. pic-build is successful.
compile.output.txt
build_picongpu _h.txt
build_picongpu _v.txt
picongpu.profile.txt

StevE-Ong on 8 Oct 2020

After you have done source $HOME/src/spack/share/spack/setup-env.sh (or with your directory of spack itself) and spack load picongpu, there will be all the normal PIConGPU environment variables, as defined in other .profile files and used in the documentation. In particular, $PICSRC for the main PIConGPU source directory, and $PIC_EXAMPLES to its internal share/picongpu/examples . So things like that should work.

I have not yet taken a look at your output files, will do after lunch.

sbastrakov on 8 Oct 2020

Btw we now have spack for the last PIConGPU release 0.5.0 as well, it is available once you update (fetch and pull) https://github.com/ComputationalRadiationPhysics/spack-repo

sbastrakov on 8 Oct 2020

After you have done source $HOME/src/spack/share/spack/setup-env.sh (or with your directory of spack itself) and spack load picongpu, there will be all the normal PIConGPU environment variables, as defined in other .profile files and used in the documentation. In particular, $PICSRC for the main PIConGPU source directory, and $PIC_EXAMPLES to its internal share/picongpu/examples . So things like that should work.

I have not yet taken a look at your output files, will do after lunch.

I have been thinking what did spack load picongpu did and I was looking around to find where the picongpu is installed. Now I see it also define $PIC_EXAMPLES and others. So which means we do not need .profile if we use spack?

StevE-Ong on 8 Oct 2020

I think generally the idea is that one does not need a custom .profile file. But does still need a .tpl file, which your previous questions concerned.

sbastrakov on 9 Oct 2020

Sorry for the late response regarding the crash. Thanks for the provided info. It seems to me that the code was built with one CUDA version and ran with another loaded in environment, which caused a version mismatch and a crash. To investigate / test it, could you check your environment on the node just before starting PIConGPU? By checking loaded module (maybe CUDA is loaded automatically on the system?) and running nvidia-smi on the GPU node?

sbastrakov on 9 Oct 2020

@sbastrakov Please take your time. I am aware of running nvidia-smi but it turns out to give Failed to initialize NVML: Driver/library version mismatch. Some online searches suggest reboot could help. So I think only the system administrator can solve this for now.

StevE-Ong on 9 Oct 2020

👍2

Seems so. My basic idea is just to make it so that with the PIConGPU spack environment activated (spack load), nvidia-smi should give the same versions as you have in the corresponding profile, and so the same that will be used in pic-build. This should remove that "Failed to initialize..." error you are getting and so things would at least move further on.

sbastrakov on 9 Oct 2020

Hi, currently the nvidia-smi is fixed with:

| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   37C    P0    51W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
....

spack find returns:

==> 37 installed packages
-- linux-ubuntu18.04-skylake_avx512 / [email protected] -----------------
[email protected]        [email protected]     [email protected]      [email protected]  [email protected]   [email protected]        [email protected]
[email protected]      [email protected]    [email protected]  [email protected]       [email protected]      [email protected]
[email protected]  [email protected]  [email protected]      [email protected]       [email protected]   [email protected]
[email protected]         [email protected]      [email protected]    [email protected]     [email protected]    [email protected]
[email protected]          [email protected]      [email protected]    [email protected]  [email protected]  [email protected]
[email protected]         [email protected]    [email protected]      [email protected]   [email protected]        [email protected]

The errors are stderr.txt.
build_picongpu _v.txt

This is running with 4.cfg in LWFA examples. When running with 16.cfg it gives

[10/12/2020 15:02:59][thor][24]:ERROR: CUDA error: invalid device ordinal, line 131, file /data/storage/jfong/src/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-7.5.0/picongpu-0.5.0-k5yhtwhbmmh6tw2f3egu4ms4z4d3ayda/thirdParty/cuda_memtest/cuda_memtest.cu

Is this something to do with the cuda version or the configuration in .tpl is not set properly?

StevE-Ong on 12 Oct 2020

Hello @StevE-Ong ,

From the data provided, not sure if the issue is with the .tpl or something else. To help investigating, please attach the archived tbg subdirectory inside the output directory (that you set as the last parameter when launching with tbg) for both 4.cfg and 16.cfg runs. (These subdirectories are created right after the tbg command, before starting PIConGPU or cuda_memtest and have more technical run parameters)
Also, are you now using the "standard" spack .profile or your custom one?

sbastrakov on 12 Oct 2020

Hi @sbastrakov.

I uploaded two runs outputs. lwfa0000 was run with 4.cfg while lwfa0001 was run with 16.cfg. The gpu.tpl and picongpu.profile were also included. During the run, I did not do source picongpu.profile. I only did spack load [email protected].

picongpu_test.zip

StevE-Ong on 12 Oct 2020

👍1

Thanks for provided info @StevE-Ong .
Only now I realized that I missed one thing from the very start, sorry. Does the system use SLURM at all, or are the users supposed to just start tasks with mpiexec directly? The .tpl you used is for SLURM only, and with the second scenario you would need to put the options directly to mpiexec. If this is the case, it would also explain the errors you are getting. We can of course then adjust everything for this case, this would not be a big problem.

sbastrakov on 12 Oct 2020

Hi @sbastrakov, I am using SLRUM.

A quick update!...I think a setting is not right.

I uncomment SBATCH --cpus-per-task=!TBG_coresPerGPU in gpu.tpl and it now run. In stderr file, it wrote

==> Error: Spec '[email protected]%[email protected]~adios+hdf5~isaac+png backend=cuda cudacxx=nvcc arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected] arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected] arch=linux-ubuntu18.04-skylake_avx512 ^[email protected]%[email protected]...........

But the stderr gives

Running program...
using default compiler
PIConGPU: 0.5.0
  Build-Type: Release

Third party:
  OS:         Linux-5.3.0-51-generic
  arch:       x86_64
  CXX:        GNU (7.5.0)
  CMake:      3.18.2
  CUDA:       10.2.89
  mallocMC:   2.3.1
  Boost:      1.70.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (3.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      NOTFOUND
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 2097152
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time:  1min 10sec 307msec = 70 sec
  0 % =        0 | time elapsed:                   24msec | avg time per step:   0msec
  5 % =      200 | time elapsed:             2sec 751msec | avg time per step:  13msec
......
100 % =     4000 | time elapsed:       1min 29sec 619msec | avg time per step:  23msec
calculation  simulation time:  1min 29sec 666msec = 89 sec
full simulation time:  2min 46sec 136msec = 166 sec

StevE-Ong on 12 Oct 2020

❤1 👍1

Perhaps there is some issue with setting up SLURM variables. Please note these are redundant, and on some machines there are limitations on what subsets of variables to use, and we had to modify some profiles for the correct subset. Perhaps the documentation of that machine has some examples or recommendations on that? Or in the worst case you could go with a trial-and-error process to determine the right subset, PIConGPU is not needed for it, just any MPI HelloWorld-type application

sbastrakov on 12 Oct 2020

Hi @sbastrakov You are right. I have tested with several configurations (gpu.tpl and mpiexec.tpl) and only 16 CPUs were allowed for 16 GPUs. That is .TBG_coresPerGPU has to be 1. So far I have no idea how it is like that and probably it has something to do with the SLURM setting as you mentioned. I did not set up the machine, unfortunately, but will keep looking for solutions.

Other stuff such as hdf5 and png_writer is working fine. Just one thing is that the file ~/simOutput/output is not produced and the python script using eh_data = EnergyHistogramData(rf"{run_dir}") returns error saying that the file ~/simOutput/output is not found.

The version 0.5.0 is now capable to adjust the grid size to the multiple of supercell size, which is great, save a lot of time in looking for the correct grid number.

I have another question that is not related to this topic. Is the the 0.5.0 version is capable to output the time-averaged fields? Should I open another separate issue on this? Thanks.

StevE-Ong on 16 Oct 2020

👍1

Yes, I think these things are just configured by system admins (and the system itself may pose some limitations there as well). For PIConGPU it should not be that big of a deal generally to have one core per GPU: all computational kernels are on GPU and we only really use CPU cores intensively for some types of output. So i don't think this is a severe limitation for PIConGPU relatively. However, If you want to solve it, I don't think it is possible on the user level, but only via system admins.

simOutput/output is a symbolic link, created just after the simOutput dir is made. This is part of a.tpl file, e.g. on lines 92-93 in your previously attached gpu.tpl:

cd simOutput
ln -s ../stdout output

Not sure what went wrong, but maybe if you do this sequence manually there will be a more clear error?

I don't think we have time-averaged fields as of now. I think it can be done relatively straighforwardly as a postprocessing step using openPMD API (but again I am not aware that we offer it out-of-the-box). Sure, please feel free to open issues for such feature requests, for this one could also bump a really old one #58

sbastrakov on 16 Oct 2020

I believe the issue is resolved, closing.

sbastrakov on 25 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Pusher: Structure-preserving, E-Field Compensating

ax3l · 4Comments

What's up with Axl's mustache?

bussmann · 4Comments

Transforming the FoilLCT example into a 3D model

cbontoiu · 3Comments

cannot use more cells in the transverse directions

cbontoiu · 3Comments

Doc: PMacc (Memory) Boxes

ax3l · 3Comments