Picongpu: Wexac Cluster at Weizmann

Created on 24 Jan 2021 · 63Comments · Source: ComputationalRadiationPhysics/picongpu

Hi,

I got contacted by Dan Levy @danlevy100 about help with setting up PIConGPU on the Wexac cluster at Weizmann Institute of Science (wexac-wis).
The cluster has 12 nodes of 8x V100 per node (plus some nodes with 4x V100).

The cluster uses LSF as a batch system but does not seem to use jsrun (maybe just use mpiexec).
He got PIConGPU installed via Spack already.

This is an interactive startup command for FBPIC:

bsub -J sim_fbpic -o out.%J -e err.%J -q gpu-short -gpu "num=1:mode=shared:j_exclusive=no" -R "rusage[mem=16000]" 'python lwfa_script.py’

Someone please needs to finalize with him the .tpl template for tbg and the picongpu.profile instructions for our manual.

Resources:

https://www.weizmann.ac.il/DIS/high-performance-computing/wexac-quick-guide/general
queue-names: gpu-short, gpu-medium and gpu-long
bsub rusage: https://www.ibm.com/support/pages/rusage-bsub (check if mem is per node or for the whole job) or maybe use -M

cc @PrometheusPi (recently published PIConGPU sims with Dan, maybe you can finalize this?)
cc @hightower8083 (not with Weizmann anymore but might have some hints)

documentation install machinsystem

Source

ax3l

👍2

Most helpful comment

Seems like the V100 issue is inconsistent. I did get it to work a couple of times. I'm still trying to figure out what's going on.

In any case, things are working nicely and quickly on the Quadro nodes! There are just some minor issues left to solve:

Would be convenient to have a .tpl file.
The build warnings (were indeed in the stderr!) might have some significance, I don't know. The file is attached now.
PNGwriter is not found so the png plugin does not work (see the stdout of pic-build, also attached here).
openPMD is not fully working, i.e, --checkpoint.restart openPMD results in

Unhandled exception of type 'St13runtime_error' with message '
Using ADIOS1 through PIConGPU's openPMD plugin is not supported.
Please pick either of the following:
* Use the ADIOS plugin.
* Use the openPMD plugin with another backend, such as ADIOS2.
  If the openPMD API has been compiled with support for ADIOS2, the openPMD API
  will automatically prefer using ADIOS2 over ADIOS1.
  Make sure that environment variable OPENPMD_BP_BACKEND is not set to ADIOS1.
                ', terminating

picbuild.stdout.txt
picbuild.stderr.txt

EDIT: It's --checkpoint.backend openPMD of course, not --checkpoint.restart openPMD

danlevy100 on 16 Feb 2021

👍3

All 63 comments

Hi guys and welcome to my first github comment!

Here's the .tpl file Axel helped me to create:
gpu_batch.tpl.txt

Sadly things are not yet working, i.e., I can't get tbg to submit to the user given queue at the moment.

Thanks in advance for your help!

danlevy100 on 24 Jan 2021

@danlevy100 I would be glad to help you setting up the configuration for Wexac. Since I am busy till Tuesday evening, I could start to look into this on Wednesday. Would this be fine with you?

PrometheusPi on 24 Jan 2021

👍2

Thank you for taking care of this, @PrometheusPi :+1:

ax3l on 25 Jan 2021

That would be great, @PrometheusPi. Thanks! I'll try to make some progress on my own in the meantime.

danlevy100 on 25 Jan 2021

👍1

@danlevy100 can you please document the current error message about the memory here?

ax3l on 25 Jan 2021

After submitting the LaserWakefield example with

tbg -s bsub -c etc/picongpu/1.cfg -t etc/picongpu/wexac-wis/gpu_batch.tpl ~/picOutput/LaserWakefield -f

I get:
Memory reservation is (MB): 8192
Memory Limit is (MB): 8192
femalka: No such queue. Job not submitted.

"femalka" is Victor's username in fact... I have no idea why it appears here.

danlevy100 on 25 Jan 2021

👍1

In order to figure it out, it would be helpful to see what is the resulting submission command after tbg and your .tpl file is applied. For the provided tbg command line there should be a file ~/picOutput/LaserWakefield/tbg/submit.start. If it is there, could you attach it? It should contain, among other things, the plain bsub command inside, and so we can compare to what @ax3l wrote for FBPIC.

sbastrakov on 25 Jan 2021

Sure, here it is:
submit.start.txt

danlevy100 on 25 Jan 2021

Thank you @danlevy100 .

So far I see an issue in the gpu_batch.tpl file attached earlier in this topic. On line 30 there is a spurious space in #B SUB (should be #BSUB). I believe it causes that and the following #BSUB lines to have no effect, and so leads to improper set of parameters. I don't know if it is the only issue, and do not have access to a similar machine to check.

sbastrakov on 25 Jan 2021

👍1

In case it does not fix the problem, I think the relevant information may be not just in submit.start, as otherwise it looks fine to me. According to the documentation linked by Axel, probably it is worth looking into output of bjobs -l JOBID to see if the partition and other things are being set correctly.

sbastrakov on 25 Jan 2021

Thanks @sbastrakov for having a look. I saw this but thought that maybe it was just to comment out the line. I have removed it anyway and submitted and still no go.
As for bjobs -l, since the job is not being submitted there is no information displayed about it.

danlevy100 on 25 Jan 2021

👍2

Sine the error message was about memory, could you please decrease the set memory from #BSUB -M 8192 to half of it? Why is there so much more memory used for the fbpic runs? (16000), or is this defined differently?

EDIT:
This definition is in kB - thus 8192 kB is definitely very low - please adjust the memory needed accordingly and use the same as with fbpic:

-R "rusage[mem=16000]"

PrometheusPi on 27 Jan 2021

Furthermore you seem to not define a project #BSUB -P. I am not sure how this is handled since your fbpic run does not define a project, I assume you have a default one or none are used. Please try to remove that line - perhaps seting an empty projects creates an error while setting none just uses the default.

PrometheusPi on 27 Jan 2021

If this does not work, we could schedule a video meeting try try things out live.

PrometheusPi on 27 Jan 2021

Yep, we tried those already. I guess you will be most efficient with a VC :)

ax3l on 27 Jan 2021

Something that should be mentioned: the way things are set up is that I have installed picongpu at the node level ("interactive session" like getNode on hemera). Submitting a job is thus only possible at the node level. Perhaps this was a mistake, but I could not get things to work otherwise.

When submitting a job, it appears that the memory is limited by the memory requested for the interactive session. Strange, but I think that it is the case.

Also, the error as far as I understand it is not a memory error but a "femalka: No such queue" error.

VC would be great. I'm available throughout most of the day tomorrow and on Friday if that works for you.

danlevy100 on 27 Jan 2021

@danlevy100 Okay then let's do a VC tomorrow. @sbastrakov Do you want to join as well?

PrometheusPi on 27 Jan 2021

I can

sbastrakov on 27 Jan 2021

14:00 Dresden time works for you?

danlevy100 on 28 Jan 2021

@danlevy100 That would be fine with me. How about you @sbastrakov ?
In order to better work together on the submit file (other than us suggesting changes on your submit file we see via screen sharing) I would recommend the Atom editor together with the teletypepackage, so that we all an type together. Would that be fine with you two?

PrometheusPi on 28 Jan 2021

👍1

@danlevy100
Is the following submit script queued/executed by LSF?

#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi

and than just submitted via bsub without extra arguments?

PrometheusPi on 28 Jan 2021

That is fine with me as well

sbastrakov on 28 Jan 2021

👍2

@danlevy100
Is the following submit script queued/executed by LSF?
#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi
and than just submitted via bsub without extra arguments?

bsub in fact fails with the same error.

danlevy100 on 28 Jan 2021

I could also try to get a cluster admin to join our meeting, do you think this could prove useful?

danlevy100 on 28 Jan 2021

UPDATE:

@danlevy100
Is the following submit script queued/executed by LSF?
#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi
and than just submitted via bsub without extra arguments?
bsub in fact fails with the same error.

UPDATE:
I got this script to work. The secret is to execute bsub < test_script.sh and not bsub test_script.sh.

danlevy100 on 28 Jan 2021

Yes, if a cluster admin could join the meeting, that would be great :+1:

PrometheusPi on 28 Jan 2021

👍1

@danlevy100 Yes, you are right, this seems to be different in bsub - see https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin/job_scripts_writing.html

PrometheusPi on 28 Jan 2021

@danlevy100 According to above reference, in lsf.conf the LSB_BSUB_PARSE_SCRIPT parameter should be set to Y. Could you please check whether this variable can be overwritten on a shell level:

echo $LSB_BSUB_PARSE_SCRIPT
export LSB_BSUB_PARSE_SCRIPT="Y"
bsub test_file_from_above.sh

PrometheusPi on 28 Jan 2021

@danlevy100 According to above reference, in lsf.conf the LSB_BSUB_PARSE_SCRIPT parameter should be set to Y. Could you please check whether this variable can be overwritten on a shell level:
echo $LSB_BSUB_PARSE_SCRIPT
export LSB_BSUB_PARSE_SCRIPT="Y"
bsub test_file_from_above.sh

Setting the variable does work (i.e., echoing it after export gives Y) but bsub still doesn't work without the <.

danlevy100 on 28 Jan 2021

This explains why your job is not submitted with tbg, because bsub tbg/submit.start makes no sense if bsub < tbg/submit.start would be needed.

PrometheusPi on 28 Jan 2021

@danlevy100 Could you please try:

bsub  -Zs test_file_from_above.sh

PrometheusPi on 28 Jan 2021

i guess it is an explanation. To try work around it, one can add this < by manually modifying this piece of tbg to be $submit_command < tbg/submit.start. Or we can do it together in a VC

sbastrakov on 28 Jan 2021

@danlevy100 Could you please try:
bsub  -Zs test_file_from_above.sh

Still no go

danlevy100 on 28 Jan 2021

Wait, but also tbg/submit.start is a shell script already. So perhaps just the first line of that has to be changed? Like to #!/bin/bash as given in the quick guide

sbastrakov on 28 Jan 2021

This explains why your job is not submitted with tbg, because bsub tbg/submit.start makes no sense if bsub < tbg/submit.start would be needed.

I have actually tried bsub < tbg/submit.start and got a new error:

When LSB_CSM_JOBS is not set to Y and -csm is not set then other csm options are not allowed. Job not submitted.

danlevy100 on 28 Jan 2021

Submitting again after setting export LSB_CSM_JOBS="Y" gives:

You cannot specify -R/-M/-n/LSB_DEFAULT_RESREQ when CSM Easy Mode job submission is enabled. Job not submitted.

danlevy100 on 28 Jan 2021

This definitely looks as if the cluster does not allow for scripted job files. Thus if LSB_CSM_JOBS != Y there are other options disabled which were used in the *.cfg.
This is something a cluster admin has to answer how they would like to handle job scripts on their cluster.

PrometheusPi on 28 Jan 2021

The admin will be there. Hopefully we could solve it together with him.

https://weizmann.zoom.us/j/99360687398?pwd=dkhTeDJQaHltYWlnelM5cnNaR2o4UT09

danlevy100 on 28 Jan 2021

We could get PIConGPU to run, but it only worked if the task were on one the same node.
mpiexec -n seems to only schedule tasks on the MPI-rank=0 node.
However, LSF schedules nodes, as can be seen when checking the variable $LSB_HOSTS while running.
Thus there seems to be some misconfiguration on how mpiexec finds the available machines (it seems to only use the first in that list.)
To get to multiple nodes, we manually defined a machinefile and used it via mpiexec --machinefile as follows:

echo $LSB_HOSTS | sed -e 's/ /\n/g' > machinefile.txt
mpiexec -n 16 --machinefile machinefile.txt hostname

This apparently told mpiexec to use the nodes scheduled by LSF, but when mpiexec tried to connect to these nodes via ssh, it failed with a authentication error.

PrometheusPi on 28 Jan 2021

👍1

Is it possible to ask the admin how to start job MPI obs on multiple nodes? I would say MPI is not compiled with support for the batch system therefore MPI is not using the information stored in $LSB_HOSTS

psychocoderHPC on 28 Jan 2021

👍1

Yes, that's the plan. It was told to us that not many run multi-node jobs there and it may require a certain MPI version to work. Which shouldn't be a problem once we know which version it is.

sbastrakov on 28 Jan 2021

Update: There is no password-less ssh into GPU nodes. This is only available on non-GPU nodes or for admins. The next test will be together with an admin to run multi-node GPU jobs in admin-mode.

PrometheusPi on 1 Feb 2021

_Another update:_
We eventually gave up the spack approach and went for modules.

Using openmpi/2.0.1 we finally got mpi to work today. We successfully ran a "bare-bones" version of picongpu. Now we need to install the remaining modules which I will do with the help of the cluster admin next week.

Here is the simple.profile file that we used:

module load gcc/6.3.0
module load cmake/3.18.4
module load openmpi/2.0.1
module load cuda/9.2
module load boost/1.69.0

export CXX=$(which g++)
export CC=$(which gcc)

export PICSRC=$HOME/src/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="cuda:70"

export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin

danlevy100 on 4 Feb 2021

👍2

@danlevy100 As promised on Monday, you can find a setup script here:
https://gist.github.com/PrometheusPi/3b873c754fbb0f0a2684480d0969410f

Please be aware of the comments that state which lines should be copied to your picongpu.profile as well.

I have not yet tested that script. Thus there still might be some bugs included. If any install fails, please let me know.

After you installed all dependencies, you should be able to run PIConGPU as on hemera. If that is the case, I would be very happy if you could share a submit.start file here, so that we can develop a general *.tpl file based for the Wexac cluster.

PrometheusPi on 10 Feb 2021

👍1

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

The curl command for zlib should include the filename as well. Not a problem, installed correctly.
openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

danlevy100 on 10 Feb 2021

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

The curl command for zlib should include the filename as well. Not a problem, installed correctly.

openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

The linker error is saying you should compile ADIOS with -fPIC enabled. You should use

./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --enable-static --enable-shared --prefix=$LIB/adios --with-mpi=$MPI_ROOT --with-zlib=$LIB/zlib --with-blosc=$LIB/c-blosc

@PrometheusPi Could you please update your gist.

For testing ADIOS1 is fine but I would suggest switching to ADIOS2 because there is no real support for ADIOS1 anymore and openPMD-api is also working much better with ADIOS2.

psychocoderHPC on 10 Feb 2021

👍2

@danlevy100 Thanks - yes I quickly changed my initial wget command to curl but forget that curl requires files. 😓 I now changed it back to wget.

@psychocoderHPC I fixed the gist. Thanks for your look at it. Is the readthedocs documentation correct or is the order wrong or the CXX flag missing?

PrometheusPi on 10 Feb 2021

@danlevy100 It might be that you have to rebuild libpng as well. It might have linked to the system zlib, not the one you installed. I fixed the gist on that.

PrometheusPi on 10 Feb 2021

Alright, it seems like everything installed fine. However running the simulation fails with an openPMD error:

../LWF/input/bin/picongpu: error while loading shared libraries: libopenPMD.so: cannot open shared object file: No such file or directory

I reinstalled everything with the new script, rebuilt the simulation but still no go.

P.S. there's a small typo at the gist: wegt -> wget.

danlevy100 on 11 Feb 2021

@danlevy100 Sorry for the typo 😓 - I fixed the gist.

I have an idea: could you please check, whether in $LIB/openPMD-api/ there is a lib directory? Or is there only a lib64 directory?
If there is only a lib64 directory, please change the LD_LIBRARY_PATH extension to:

-export LD_LIBRARY_PATH="$LIB/openPMD-api/lib:$LD_LIBRARY_PATH"
+export LD_LIBRARY_PATH="$LIB/openPMD-api/lib64:$LD_LIBRARY_PATH"

PrometheusPi on 11 Feb 2021

👍2

@PrometheusPi That solved it!

Seems like everything is installed and set up correctly now. But... Now there's a new error, mpi related:

[hgn10.wexac.weizmann.ac.il:45451] 17 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[hgn02.wexac.weizmann.ac.il:26305] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

When running on the dgx nodes (V100's) the error is slightly different:

[ibdgx009.wexac.weizmann.ac.il:80283] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[ibdgx009.wexac.weizmann.ac.il:80283] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

In fact, this error shows up now also when running the bare-bones simulation. I have no idea what changed from the time it worked. The error also occurs when requesting only 1 GPU (that's the error above).

danlevy100 on 11 Feb 2021

👍1

That is MPI related. So could be either something changed or broke down on the cluster side. Or something in your environment has changed either by introducing more dependencies, or accidentally. To check on your side, you could try making a fresh session, load the environment we used last time during the VC and re-compile and re-run that MPI hello world mini application on CPU and GPU partition. If it's also broken now, it probably needs attention of cluster admin. If it still works, I would first suspect your environment.

sbastrakov on 11 Feb 2021

Well, that was my bad: the simulation actually ran successfully! Just extremely fast, so I didn't even look at stdout...
Turns out this error was there also when we ran the bare-bone simulation. Doesn't seem to hurt, so I don't really care.

The simulation runs fine on the Quadro RTX 6000 nodes. However that's not the case for the V100 nodes which fail due to some memory allocation issue. I have attached the stdout and stderr files for this job. The submit.start file is attached as well.

While we're at it, I have also attached the output of pic-build of the FoilLCT example. It shows some warnings that I'm not familiar with when running over hemera. Maybe that is helpful in some way. EDIT: turns out the cmake warnings did not register in the file attached (pic-build | tee output.file didn't do the trick). I'll look into this later.

FoilLCT.build.txt
stderr.149243.txt
stdout.149243.txt
submit.start.txt

danlevy100 on 14 Feb 2021

👍2

Regarding the crash on V100s, from the attached files I am not sure what went wrong, besides your observation that it's probably something with memory allocation. You could try investigating by enabling more debug output: rebuilding with pic-build -c "-DPMACC_BLOCKING_KERNEL=ON -DCMAKE_BUILD_TYPE=Debug -DPIC_VERBOSE=21" (when using an existing directory, please remove subdirectory .build first) and re-running. It may produce more details about what went wrong.

Regarding the FoilLCT compilation, I do not see anything suspicious in the attached file. Perhaps that part of the output went to stderr, not stdout?

sbastrakov on 15 Feb 2021

👍1

Seems like the V100 issue is inconsistent. I did get it to work a couple of times. I'm still trying to figure out what's going on.

In any case, things are working nicely and quickly on the Quadro nodes! There are just some minor issues left to solve:

Would be convenient to have a .tpl file.
The build warnings (were indeed in the stderr!) might have some significance, I don't know. The file is attached now.
PNGwriter is not found so the png plugin does not work (see the stdout of pic-build, also attached here).
openPMD is not fully working, i.e, --checkpoint.restart openPMD results in

Unhandled exception of type 'St13runtime_error' with message '
Using ADIOS1 through PIConGPU's openPMD plugin is not supported.
Please pick either of the following:
* Use the ADIOS plugin.
* Use the openPMD plugin with another backend, such as ADIOS2.
  If the openPMD API has been compiled with support for ADIOS2, the openPMD API
  will automatically prefer using ADIOS2 over ADIOS1.
  Make sure that environment variable OPENPMD_BP_BACKEND is not set to ADIOS1.
                ', terminating

picbuild.stdout.txt
picbuild.stderr.txt

EDIT: It's --checkpoint.backend openPMD of course, not --checkpoint.restart openPMD

danlevy100 on 16 Feb 2021

👍3

This exception with checkpointing is because we no longer support ADIOS1 as backend for openPMD for output in newly made simulations. We only have a legacy support for it, so that a user can still restart from an older checkpoint written with ADIOS1. So currently you need openPMD to be compiled with, and then using either ADIOS2 or HDF5 backend. So you need to install yourself or via admin one of those or both to use dev version of PIConGPU.

Regarding the png support, in case you installed it yourself, probably the profile has to be extended with lines like these, where line 55 points to your local installation. The png support is of course optional.

Those warning #2381-D: dynamic exception specifications are deprecated concern libSplash that we still use for output of hdf5 files in some plugins. As main output and checkpoints it was already replaced with openPMD-API. So you can ignore those warnings, they do not cause issues and will disappear once we fully drop libSplash. The last couple of warnings come from openPMD API itself, I will check if they exist in the current version of it.

sbastrakov on 17 Feb 2021

👍1

Regarding ADIOS1 vs. 2 there is this explanation, with a TL;DR at the end.

sbastrakov on 17 Feb 2021

👍1

Ok, I see.

For installing ADIOS2, I looked at https://adios2.readthedocs.io/en/latest/setting_up/setting_up.html but am unsure of which flags I should use when building, i.e., what should be the exact commands as in @PrometheusPi 's gist file from a few comments up. Don't want to mess things up which are already working...

For the .tpl file, I think I can write it myself. I'll upload it here once it's working.

EDIT: the only thing I don't know how to do is how to tell tbg to submit with bsub < submit.start instead of bsub submit.start.
EDIT 2: Found it... Tried tbg -s "bsub <" ... but that did not work, so I changed ~/src/picongpu/tbg to do it.

danlevy100 on 17 Feb 2021

👍1

Regarding installation of ADIOS2 I found out that we actually forgot to update our instructions page which now only has ADIOS1. Working on fixing it. I don't think it would break anything, as you can always hide it from openPMD API if something does not work.

sbastrakov on 19 Feb 2021

👍1

@danlevy100 Is your last submit.start file from 20 days ago the one working best and still valid?

If yes, we can setup a *.tpl file. If not, would you please share your current version.

PrometheusPi on 8 Mar 2021

Now our (dev version of) readthedocs is updated with links on installing ADIOS2 as an openPMD-API backend.

sbastrakov on 12 Mar 2021

👍1

Here it is.

The latest status is: the V100 and RTX2000 nodes are not working reliably probably due to mpi issues. I don't know how to solve this and in the meantime I'm working with the RTX6000/8000 nodes.

I have also added this text to the file:

#There are 3 relevant GPU queues on WEXAC: gpu-short (32 gpu's/6 hours max), gpu-medium (24 gpu's/12 hours max)
#and gpu-long (16 gpu's/10 days max). There are different GPU nodes: RTX-2000/6000/8000 and V100.
#The nodes can be explicitly selected using the BSUB -m command.
#The RTX-2000 and V100's do not reliably work at the moment apparently due to mpi issues.

submit_short.tpl.txt

danlevy100 on 14 Mar 2021

👍1

There was a small but important mistake in the memory request in that file (24 GB instead of 48 GB).
Here is the corrected file.
submit_short.tpl.txt

danlevy100 on 14 Mar 2021

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Transforming the FoilLCT example into a 3D model

cbontoiu · 3Comments

Docs suggestion: extend Basics with info on stdout and output directory?

sbastrakov · 3Comments

PIConGPU Development Visualization

berceanu · 3Comments

LWFA Example: Boost Bad Anycast Abort

ax3l · 4Comments

Bash Completion for Most Important Scripts

ax3l · 4Comments