Hi,
I got contacted by Dan Levy @danlevy100 about help with setting up PIConGPU on the Wexac cluster at Weizmann Institute of Science (wexac-wis).
The cluster has 12 nodes of 8x V100 per node (plus some nodes with 4x V100).
The cluster uses LSF as a batch system but does not seem to use jsrun (maybe just use mpiexec).
He got PIConGPU installed via Spack already.
This is an interactive startup command for FBPIC:
bsub -J sim_fbpic -o out.%J -e err.%J -q gpu-short -gpu "num=1:mode=shared:j_exclusive=no" -R "rusage[mem=16000]" 'python lwfa_script.py’
Someone please needs to finalize with him the .tpl template for tbg and the picongpu.profile instructions for our manual.
Resources:
gpu-short, gpu-medium and gpu-longmem is per node or for the whole job) or maybe use -Mcc @PrometheusPi (recently published PIConGPU sims with Dan, maybe you can finalize this?)
cc @hightower8083 (not with Weizmann anymore but might have some hints)
Hi guys and welcome to my first github comment!
Here's the .tpl file Axel helped me to create:
gpu_batch.tpl.txt
Sadly things are not yet working, i.e., I can't get tbg to submit to the user given queue at the moment.
Thanks in advance for your help!
@danlevy100 I would be glad to help you setting up the configuration for Wexac. Since I am busy till Tuesday evening, I could start to look into this on Wednesday. Would this be fine with you?
Thank you for taking care of this, @PrometheusPi :+1:
That would be great, @PrometheusPi. Thanks! I'll try to make some progress on my own in the meantime.
@danlevy100 can you please document the current error message about the memory here?
After submitting the LaserWakefield example with
tbg -s bsub -c etc/picongpu/1.cfg -t etc/picongpu/wexac-wis/gpu_batch.tpl ~/picOutput/LaserWakefield -f
I get:
Memory reservation is (MB): 8192
Memory Limit is (MB): 8192
femalka: No such queue. Job not submitted.
"femalka" is Victor's username in fact... I have no idea why it appears here.
In order to figure it out, it would be helpful to see what is the resulting submission command after tbg and your .tpl file is applied. For the provided tbg command line there should be a file ~/picOutput/LaserWakefield/tbg/submit.start. If it is there, could you attach it? It should contain, among other things, the plain bsub command inside, and so we can compare to what @ax3l wrote for FBPIC.
Sure, here it is:
submit.start.txt
Thank you @danlevy100 .
So far I see an issue in the gpu_batch.tpl file attached earlier in this topic. On line 30 there is a spurious space in #B SUB (should be #BSUB). I believe it causes that and the following #BSUB lines to have no effect, and so leads to improper set of parameters. I don't know if it is the only issue, and do not have access to a similar machine to check.
In case it does not fix the problem, I think the relevant information may be not just in submit.start, as otherwise it looks fine to me. According to the documentation linked by Axel, probably it is worth looking into output of bjobs -l JOBID to see if the partition and other things are being set correctly.
Thanks @sbastrakov for having a look. I saw this but thought that maybe it was just to comment out the line. I have removed it anyway and submitted and still no go.
As for bjobs -l, since the job is not being submitted there is no information displayed about it.
Sine the error message was about memory, could you please decrease the set memory from #BSUB -M 8192
to half of it? Why is there so much more memory used for the fbpic runs? (16000), or is this defined differently?
EDIT:
This definition is in kB - thus 8192 kB is definitely very low - please adjust the memory needed accordingly and use the same as with fbpic:
-R "rusage[mem=16000]"
Furthermore you seem to not define a project #BSUB -P. I am not sure how this is handled since your fbpic run does not define a project, I assume you have a default one or none are used. Please try to remove that line - perhaps seting an empty projects creates an error while setting none just uses the default.
If this does not work, we could schedule a video meeting try try things out live.
Yep, we tried those already. I guess you will be most efficient with a VC :)
Something that should be mentioned: the way things are set up is that I have installed picongpu at the node level ("interactive session" like getNode on hemera). Submitting a job is thus only possible at the node level. Perhaps this was a mistake, but I could not get things to work otherwise.
When submitting a job, it appears that the memory is limited by the memory requested for the interactive session. Strange, but I think that it is the case.
Also, the error as far as I understand it is not a memory error but a "femalka: No such queue" error.
VC would be great. I'm available throughout most of the day tomorrow and on Friday if that works for you.
@danlevy100 Okay then let's do a VC tomorrow. @sbastrakov Do you want to join as well?
I can
14:00 Dresden time works for you?
@danlevy100 That would be fine with me. How about you @sbastrakov ?
In order to better work together on the submit file (other than us suggesting changes on your submit file we see via screen sharing) I would recommend the Atom editor together with the teletypepackage, so that we all an type together. Would that be fine with you two?
@danlevy100
Is the following submit script queued/executed by LSF?
#!/usr/bin/env bash
#BSUB -J test
#BSUB -o test.out
#BSUB -e test.err
#BSUB -q gpu-short
#BSUB -gpu "num=1:mode=shared:j_exclusive=no"
#BSUB -R "rusage[mem=16000]"
hostname
nvidia-smi
and than just submitted via bsub without extra arguments?
That is fine with me as well
@danlevy100
Is the following submit script queued/executed by LSF?#!/usr/bin/env bash #BSUB -J test #BSUB -o test.out #BSUB -e test.err #BSUB -q gpu-short #BSUB -gpu "num=1:mode=shared:j_exclusive=no" #BSUB -R "rusage[mem=16000]" hostname nvidia-smiand than just submitted via
bsubwithout extra arguments?
bsub in fact fails with the same error.
I could also try to get a cluster admin to join our meeting, do you think this could prove useful?
UPDATE:
@danlevy100
Is the following submit script queued/executed by LSF?#!/usr/bin/env bash #BSUB -J test #BSUB -o test.out #BSUB -e test.err #BSUB -q gpu-short #BSUB -gpu "num=1:mode=shared:j_exclusive=no" #BSUB -R "rusage[mem=16000]" hostname nvidia-smiand than just submitted via
bsubwithout extra arguments?bsub in fact fails with the same error.
UPDATE:
I got this script to work. The secret is to execute bsub < test_script.sh and not bsub test_script.sh.
Yes, if a cluster admin could join the meeting, that would be great :+1:
@danlevy100 Yes, you are right, this seems to be different in bsub - see https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin/job_scripts_writing.html
@danlevy100 According to above reference, in lsf.conf the LSB_BSUB_PARSE_SCRIPT parameter should be set to Y. Could you please check whether this variable can be overwritten on a shell level:
echo $LSB_BSUB_PARSE_SCRIPT
export LSB_BSUB_PARSE_SCRIPT="Y"
bsub test_file_from_above.sh
@danlevy100 According to above reference, in
lsf.conftheLSB_BSUB_PARSE_SCRIPTparameter should be set toY. Could you please check whether this variable can be overwritten on a shell level:echo $LSB_BSUB_PARSE_SCRIPT export LSB_BSUB_PARSE_SCRIPT="Y" bsub test_file_from_above.sh
Setting the variable does work (i.e., echoing it after export gives Y) but bsub still doesn't work without the <.
This explains why your job is not submitted with tbg, because bsub tbg/submit.start makes no sense if bsub < tbg/submit.start would be needed.
@danlevy100 Could you please try:
bsub -Zs test_file_from_above.sh
i guess it is an explanation. To try work around it, one can add this < by manually modifying this piece of tbg to be $submit_command < tbg/submit.start. Or we can do it together in a VC
@danlevy100 Could you please try:
bsub -Zs test_file_from_above.sh
Still no go
Wait, but also tbg/submit.start is a shell script already. So perhaps just the first line of that has to be changed? Like to #!/bin/bash as given in the quick guide
This explains why your job is not submitted with
tbg, becausebsub tbg/submit.startmakes no sense ifbsub < tbg/submit.startwould be needed.
I have actually tried bsub < tbg/submit.start and got a new error:
When LSB_CSM_JOBS is not set to Y and -csm is not set then other csm options are not allowed. Job not submitted.
Submitting again after setting export LSB_CSM_JOBS="Y" gives:
You cannot specify -R/-M/-n/LSB_DEFAULT_RESREQ when CSM Easy Mode job submission is enabled. Job not submitted.
This definitely looks as if the cluster does not allow for scripted job files. Thus if LSB_CSM_JOBS != Y there are other options disabled which were used in the *.cfg.
This is something a cluster admin has to answer how they would like to handle job scripts on their cluster.
The admin will be there. Hopefully we could solve it together with him.
https://weizmann.zoom.us/j/99360687398?pwd=dkhTeDJQaHltYWlnelM5cnNaR2o4UT09
We could get PIConGPU to run, but it only worked if the task were on one the same node.
mpiexec -n seems to only schedule tasks on the MPI-rank=0 node.
However, LSF schedules nodes, as can be seen when checking the variable $LSB_HOSTS while running.
Thus there seems to be some misconfiguration on how mpiexec finds the available machines (it seems to only use the first in that list.)
To get to multiple nodes, we manually defined a machinefile and used it via mpiexec --machinefile as follows:
echo $LSB_HOSTS | sed -e 's/ /\n/g' > machinefile.txt
mpiexec -n 16 --machinefile machinefile.txt hostname
This apparently told mpiexec to use the nodes scheduled by LSF, but when mpiexec tried to connect to these nodes via ssh, it failed with a authentication error.
Is it possible to ask the admin how to start job MPI obs on multiple nodes? I would say MPI is not compiled with support for the batch system therefore MPI is not using the information stored in $LSB_HOSTS
Yes, that's the plan. It was told to us that not many run multi-node jobs there and it may require a certain MPI version to work. Which shouldn't be a problem once we know which version it is.
Update: There is no password-less ssh into GPU nodes. This is only available on non-GPU nodes or for admins. The next test will be together with an admin to run multi-node GPU jobs in admin-mode.
_Another update:_
We eventually gave up the spack approach and went for modules.
Using openmpi/2.0.1 we finally got mpi to work today. We successfully ran a "bare-bones" version of picongpu. Now we need to install the remaining modules which I will do with the help of the cluster admin next week.
Here is the simple.profile file that we used:
module load gcc/6.3.0
module load cmake/3.18.4
module load openmpi/2.0.1
module load cuda/9.2
module load boost/1.69.0
export CXX=$(which g++)
export CC=$(which gcc)
export PICSRC=$HOME/src/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="cuda:70"
export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin
@danlevy100 As promised on Monday, you can find a setup script here:
https://gist.github.com/PrometheusPi/3b873c754fbb0f0a2684480d0969410f
Please be aware of the comments that state which lines should be copied to your picongpu.profile as well.
I have not yet tested that script. Thus there still might be some bugs included. If any install fails, please let me know.
After you installed all dependencies, you should be able to run PIConGPU as on hemera. If that is the case, I would be very happy if you could share a submit.start file here, so that we can develop a general *.tpl file based for the Wexac cluster.
@PrometheusPi That really wonderful. Thanks!!
I gave it a shot and ran into a couple of issues:
@PrometheusPi That really wonderful. Thanks!!
I gave it a shot and ran into a couple of issues:
- The curl command for zlib should include the filename as well. Not a problem, installed correctly.
- openPMD failed to install. I tried but could not solve it. The error log is attached.
The linker error is saying you should compile ADIOS with -fPIC enabled. You should use
./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --enable-static --enable-shared --prefix=$LIB/adios --with-mpi=$MPI_ROOT --with-zlib=$LIB/zlib --with-blosc=$LIB/c-blosc
@PrometheusPi Could you please update your gist.
For testing ADIOS1 is fine but I would suggest switching to ADIOS2 because there is no real support for ADIOS1 anymore and openPMD-api is also working much better with ADIOS2.
@danlevy100 Thanks - yes I quickly changed my initial wget command to curl but forget that curl requires files. 😓 I now changed it back to wget.
@psychocoderHPC I fixed the gist. Thanks for your look at it. Is the readthedocs documentation correct or is the order wrong or the CXX flag missing?
@danlevy100 It might be that you have to rebuild libpng as well. It might have linked to the system zlib, not the one you installed. I fixed the gist on that.
Alright, it seems like everything installed fine. However running the simulation fails with an openPMD error:
../LWF/input/bin/picongpu: error while loading shared libraries: libopenPMD.so: cannot open shared object file: No such file or directory
I reinstalled everything with the new script, rebuilt the simulation but still no go.
P.S. there's a small typo at the gist: wegt -> wget.
@danlevy100 Sorry for the typo 😓 - I fixed the gist.
I have an idea: could you please check, whether in $LIB/openPMD-api/ there is a lib directory? Or is there only a lib64 directory?
If there is only a lib64 directory, please change the LD_LIBRARY_PATH extension to:
-export LD_LIBRARY_PATH="$LIB/openPMD-api/lib:$LD_LIBRARY_PATH"
+export LD_LIBRARY_PATH="$LIB/openPMD-api/lib64:$LD_LIBRARY_PATH"
@PrometheusPi That solved it!
Seems like everything is installed and set up correctly now. But... Now there's a new error, mpi related:
[hgn10.wexac.weizmann.ac.il:45451] 17 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[hgn02.wexac.weizmann.ac.il:26305] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
When running on the dgx nodes (V100's) the error is slightly different:
[ibdgx009.wexac.weizmann.ac.il:80283] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[ibdgx009.wexac.weizmann.ac.il:80283] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
In fact, this error shows up now also when running the bare-bones simulation. I have no idea what changed from the time it worked. The error also occurs when requesting only 1 GPU (that's the error above).
That is MPI related. So could be either something changed or broke down on the cluster side. Or something in your environment has changed either by introducing more dependencies, or accidentally. To check on your side, you could try making a fresh session, load the environment we used last time during the VC and re-compile and re-run that MPI hello world mini application on CPU and GPU partition. If it's also broken now, it probably needs attention of cluster admin. If it still works, I would first suspect your environment.
Well, that was my bad: the simulation actually ran successfully! Just extremely fast, so I didn't even look at stdout...
Turns out this error was there also when we ran the bare-bone simulation. Doesn't seem to hurt, so I don't really care.
The simulation runs fine on the Quadro RTX 6000 nodes. However that's not the case for the V100 nodes which fail due to some memory allocation issue. I have attached the stdout and stderr files for this job. The submit.start file is attached as well.
While we're at it, I have also attached the output of pic-build of the FoilLCT example. It shows some warnings that I'm not familiar with when running over hemera. Maybe that is helpful in some way. EDIT: turns out the cmake warnings did not register in the file attached (pic-build | tee output.file didn't do the trick). I'll look into this later.
FoilLCT.build.txt
stderr.149243.txt
stdout.149243.txt
submit.start.txt
Regarding the crash on V100s, from the attached files I am not sure what went wrong, besides your observation that it's probably something with memory allocation. You could try investigating by enabling more debug output: rebuilding with pic-build -c "-DPMACC_BLOCKING_KERNEL=ON -DCMAKE_BUILD_TYPE=Debug -DPIC_VERBOSE=21" (when using an existing directory, please remove subdirectory .build first) and re-running. It may produce more details about what went wrong.
Regarding the FoilLCT compilation, I do not see anything suspicious in the attached file. Perhaps that part of the output went to stderr, not stdout?
Seems like the V100 issue is inconsistent. I did get it to work a couple of times. I'm still trying to figure out what's going on.
In any case, things are working nicely and quickly on the Quadro nodes! There are just some minor issues left to solve:
Unhandled exception of type 'St13runtime_error' with message '
Using ADIOS1 through PIConGPU's openPMD plugin is not supported.
Please pick either of the following:
* Use the ADIOS plugin.
* Use the openPMD plugin with another backend, such as ADIOS2.
If the openPMD API has been compiled with support for ADIOS2, the openPMD API
will automatically prefer using ADIOS2 over ADIOS1.
Make sure that environment variable OPENPMD_BP_BACKEND is not set to ADIOS1.
', terminating
picbuild.stdout.txt
picbuild.stderr.txt
EDIT: It's --checkpoint.backend openPMD of course, not --checkpoint.restart openPMD
This exception with checkpointing is because we no longer support ADIOS1 as backend for openPMD for output in newly made simulations. We only have a legacy support for it, so that a user can still restart from an older checkpoint written with ADIOS1. So currently you need openPMD to be compiled with, and then using either ADIOS2 or HDF5 backend. So you need to install yourself or via admin one of those or both to use dev version of PIConGPU.
Regarding the png support, in case you installed it yourself, probably the profile has to be extended with lines like these, where line 55 points to your local installation. The png support is of course optional.
Those warning #2381-D: dynamic exception specifications are deprecated concern libSplash that we still use for output of hdf5 files in some plugins. As main output and checkpoints it was already replaced with openPMD-API. So you can ignore those warnings, they do not cause issues and will disappear once we fully drop libSplash. The last couple of warnings come from openPMD API itself, I will check if they exist in the current version of it.
Regarding ADIOS1 vs. 2 there is this explanation, with a TL;DR at the end.
Ok, I see.
For installing ADIOS2, I looked at https://adios2.readthedocs.io/en/latest/setting_up/setting_up.html but am unsure of which flags I should use when building, i.e., what should be the exact commands as in @PrometheusPi 's gist file from a few comments up. Don't want to mess things up which are already working...
For the .tpl file, I think I can write it myself. I'll upload it here once it's working.
EDIT: the only thing I don't know how to do is how to tell tbg to submit with bsub < submit.start instead of bsub submit.start.
EDIT 2: Found it... Tried tbg -s "bsub <" ... but that did not work, so I changed ~/src/picongpu/tbg to do it.
Regarding installation of ADIOS2 I found out that we actually forgot to update our instructions page which now only has ADIOS1. Working on fixing it. I don't think it would break anything, as you can always hide it from openPMD API if something does not work.
@danlevy100 Is your last submit.start file from 20 days ago the one working best and still valid?
If yes, we can setup a *.tpl file. If not, would you please share your current version.
Now our (dev version of) readthedocs is updated with links on installing ADIOS2 as an openPMD-API backend.
Here it is.
The latest status is: the V100 and RTX2000 nodes are not working reliably probably due to mpi issues. I don't know how to solve this and in the meantime I'm working with the RTX6000/8000 nodes.
I have also added this text to the file:
#There are 3 relevant GPU queues on WEXAC: gpu-short (32 gpu's/6 hours max), gpu-medium (24 gpu's/12 hours max)
#and gpu-long (16 gpu's/10 days max). There are different GPU nodes: RTX-2000/6000/8000 and V100.
#The nodes can be explicitly selected using the BSUB -m command.
#The RTX-2000 and V100's do not reliably work at the moment apparently due to mpi issues.
There was a small but important mistake in the memory request in that file (24 GB instead of 48 GB).
Here is the corrected file.
submit_short.tpl.txt
Most helpful comment
Seems like the V100 issue is inconsistent. I did get it to work a couple of times. I'm still trying to figure out what's going on.
In any case, things are working nicely and quickly on the Quadro nodes! There are just some minor issues left to solve:
picbuild.stdout.txt
picbuild.stderr.txt
EDIT: It's
--checkpoint.backend openPMDof course, not--checkpoint.restart openPMD