picongpu on marconi100 at cineca

Created on 25 May 2020 · 13Comments · Source: ComputationalRadiationPhysics/picongpu

Hello,
I'm trying to make picongpu work on the new CINECA cluster, Marconi100. Here are the main architecture specs:

Processors: 2x16 cores IBM POWER9 AC922 at 3.1 GHz
Accelerators: 4 x NVIDIA Volta V100 GPUs/node, Nvlink 2.0, 16GB
Cores: 32 cores/node, Hyperthreading x4
RAM: 256 GB/node (242 usable)

As a starting point, I forked picongpu latest release and now I'm trying to build the laser wakefield example in my $HOME via the pic-build command.
To set the environment variables I took inspiration from the picongpu.profile for the D.A.V.I.D.E. cluster, since it was a CINECA cluster as well (not in production anymore). As a first step, I also decided to load only the mandatory software for now (e.g. no hdf5 or adios).

I tried compiling in two different ways: using openmpi and spectrum_mpi (which are the two available options, of which spectrum_mpi is the recommended one), both with the [email protected] compiler.

I attach the profiles that I created for the two cases (picongpu.profile.openmpi.txt and picongpu.profile.spectrum.txt). Basically, what I do is source one of the two profiles and then simply follow your instructions with these commands:

pic-create $PIC_EXAMPLES/LaserWakefield $HOME/lwfa
cd lwfa
pic-build

The compilation fails in both cases, at different stages. I attach the stdout and stderr of the pic-build command (which I saved in picongpu.openmpi.out.txt and picongpu.spectrum.out.txt).

picongpu.profile.openmpi.txt
picongpu.profile.spectrum.txt
picongpu.openmpi.out.txt
picongpu.spectrum.out.txt

Am I doing something wrong? Do you have any recommendation or advice? Should I skip these steps and try with TBG?
Thank you

install machinsystem question

Source

aeriforme

👍4

Most helpful comment

Hello, sorry it took me a while (and I'm not done yet). I wanted to get to a point where I could share some working code before closing this issue.

I now can run on Marconi100, but with no plugins (i.e. I'm still using only the mandatory dependencies).

Here I attach a picongpu.profile example and a template file that I tested with the laser wakefield acceleration example on 1 and 4 GPUs.

m100-cineca.zip

Besides the non-mandatory dependencies, I'm guessing the .tpl is not optimal.
I suspect this because it took 25 second to finish ("full simulation time") on 1 GPU, while 1 minute on 4 GPUs. Or is it some other issue?

Also, I haven't addressed the processes affinity, even though I see that in other .tpl files you use, for example, --cpu_bind=sockets. In this regards I'm not sure what would be best.

Thank you

aeriforme on 15 Jun 2020

👍3

All 13 comments

Hi @aeriforme and welcome!

Thank you for your well-documented issue!

On a first, very brief, glance it looks to me like an error caused by the combination of packages used for compiling and nothing that you did wrong.
The steps you took are correct and TBG (Template Batch Generator) would not have helped, since it is the tool that hands over a compiled parameter set to the job system, providing the necessary template and run configuration files.

Since I am unfortunately still busy writing my thesis, let me CC this issue to my trusted colleagues @sbastrakov, @PrometheusPi, @psychocoderHPC.

Just a quick question.
The compile process during your second attempt with spectrum_mpi got very far and even though there are a lot of boost warnings (which is a well-known behavior for the code on SLURM systems), the process abruptly stops, seemingly without a real error.
Did you perhaps compile on the head node of that cluster and there is a process time limit in place which caught you?

You can check the limits with ulimit -a (check for cpu time).
If that was indeed the case, compile the code again with that setup on a compute node that you request with a job.

n01r on 26 May 2020

👍2

Hello @aeriforme and thanks for reporting!

With the Spectrum MPI build there is actually an error shown, at line 184. So it looks to be not related to exceeding the head node wall time limit as @n01r suggested. (Although we are getting this on some systems, and so potentially may be an issue on your system as well, but not that time). I am looking into that.

sbastrakov on 26 May 2020

👍1

It looks like we have one suspicious piece of code there (returning stringstream by value). This seems to need a fix anyways, and it's an easy one, will provide a PR shortly. However don't know yet if this solves the original issue.

However, I am a little puzzled why the error message occurs when compiling the YeePML field, which should be disabled in the standard LWFA example and so never compiled in this case.

sbastrakov on 26 May 2020

👍1

Unfortunately, so far I was not able to reproduce on Hemera with gcc 8. However, made a PR #3251 to address the issue I mentioned earlier. Not sure if this helps your problem @aeriforme , but you are welcome to try that branch if you have time.

sbastrakov on 26 May 2020

👍1

Thank you all!
@sbastrakov, your PR worked and the compilation went through successfully.
Now let's see if I can run it properly...

aeriforme on 26 May 2020

👍2 🎉1

Oh, glad to hear it @aeriforme ! This PR will be merged to the main branch soon.

With running, tbg and our documentation should help, please feel free to ask questions and report issues.

Would you kindly like to provide a PR with the Marconi100 profiles you developed?

sbastrakov on 26 May 2020

👍1

I'm now at the stage where I can compile and run the code, but not on the computing nodes yet. Basically, I was able to run the LFWA example from the login nodes by, again, simply following you basic instructions:
tbg -s bash -c etc/picongpu/1.cfg -t etc/picongpu/bash/mpirun.tpl out_01
Actually, I had to remove some plugins because I haven't found a way to install libSplash yet (apparently there are problems with hdf5).
Next step would be to write a .tpl to use in combination with the .profile and start running on the computing nodes.

I would gladly provide a PR with the Marconi100 profiles. Shall I do it if and when I succeed running on the compute nodes?

aeriforme on 29 May 2020

👍3

@aeriforme yes, I think it is best to provide a profile and the corresponding .tpl file together so that other users can build and run on compute nodes out of the box.

sbastrakov on 30 May 2020

We also sometimes encounter situations with some software missing, outdated, or not working on some systems. This is why a lot of our dependencies are optional, so that e.g. if there is no hdf5, PIConGPU will be compiled without it and just disable the corresponding part of functionality. After getting the minimal configuration building and running, we normally query system admins to install the missing pieces of software if that's possible, and so try to gradually get everything fully enabled.

sbastrakov on 30 May 2020

Hello @aeriforme . Do you need any assistance from us with running PIConGPU?

sbastrakov on 15 Jun 2020

Hello, sorry it took me a while (and I'm not done yet). I wanted to get to a point where I could share some working code before closing this issue.

I now can run on Marconi100, but with no plugins (i.e. I'm still using only the mandatory dependencies).

Here I attach a picongpu.profile example and a template file that I tested with the laser wakefield acceleration example on 1 and 4 GPUs.

m100-cineca.zip

Also, I haven't addressed the processes affinity, even though I see that in other .tpl files you use, for example, --cpu_bind=sockets. In this regards I'm not sure what would be best.

Thank you

aeriforme on 15 Jun 2020

👍3

Sorry, I did not mean to hurry you @aeriforme .

The difference in time you observed may be due to the cuda_memtest utility. We run it before PIConGPU only if the nodes are exclusively used by your run (it's checked at line 100 of your .tpl file). So it probably happened with 4 GPUs, but no with 1. Our idea is that for larger runs this cuda_memtest overhead is negligible, but of course you are free to disable it if you want.

Regarding affinity, I assume system documentation or admins may have the recommended way. In that regard, PIConGPU is very standard with 1 GPU per each MPI rank.

sbastrakov on 16 Jun 2020

👍1

Understood!
Thank you very much for your support.
I'll PR my files when they'll be fully functional.

aeriforme on 20 Jun 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings