Hi everyone, I am working with @TheresaBruemmer to start using PIConGPU for ICS simulations (so this issue is related to https://github.com/ComputationalRadiationPhysics/picongpu/issues/3346). Before trying to install on Maxwell, I am trying on Juwels as PIConGPU has already been installed there.
Following the instructions, from my vanilla configuration on Juwels, I did:
This last step, installing PNGwriter, failed at step make install with error
$ make install
Scanning dependencies of target PNGwriter
[ 8%] Building CXX object CMakeFiles/PNGwriter.dir/src/pngwriter.cc.o
[ 16%] Linking CXX static library libPNGwriter.a
[ 16%] Built target PNGwriter
Scanning dependencies of target lyapunov
[ 25%] Building CXX object CMakeFiles/lyapunov.dir/examples/lyapunov.cc.o
[ 33%] Linking CXX executable lyapunov
libPNGwriter.a(pngwriter.cc.o):pngwriter.cc:function pngwriter::close(): error: undefined reference to 'png_convert_to_rfc1123_buffer'
collect2: error: ld returned 1 exit status
make[2]: *** [lyapunov] Error 1
make[1]: *** [CMakeFiles/lyapunov.dir/all] Error 2
make: *** [all] Error 2
Do you know where this comes from? Should I open an issue on the pngwriter repo instead? Note that the CMake step worked well, see CMake_output.txt for the CMake output.
Hello @MaxThevenet , thanks for your report. From the top of my head I do not know the reason of this error. However, while we are investigating it, you could skip this optional dependency and just remove $PNGWRITER_ROOT: from this line and continue with building PIConGPU, temporarily without pngwriter.
Thanks @sbastrakov for your suggestion, I followed instructions at https://picongpu.readthedocs.io/en/0.5.0/usage/basics.html and, after a bunch of warnings (but no error), I received messages
Scanning dependencies of target picongpu
[100%] Linking CXX executable picongpu
[100%] Built target picongpu
so it seems that the build was successful \o/. Is TBG the usual way to go to run a simulation on Juwels? If so, I'll go through the doc and submit a 1-GPU run.
@MaxThevenet It looks like somewhat is wrong with the install libpng on juwles, a workaround is to build your own libpng. For that you also need to build your own zlib.
#load the picongpu profile before you execute this!!!!
# create install directory
mkdir -p $PARTITION_LIB
#install zlib
export ZLIB_ROOT=$PARTITION_LIB/zlib
export CMAKE_PREFIX_PATH=$ZLIB_ROOT:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$ZLIB_ROOT:$LD_LIBRARY_PATH
wget -O zlib-1.2.11.tar.gz https://github.com/madler/zlib/archive/v1.2.11.tar.gz
tar -xf zlib-1.2.11.tar.gz
cd zlib-1.2.11
./configure --prefix=$ZLIB_ROOT
make -j4
make install
cd ..
#install libpng
export PNG_ROOT=$PARTITION_LIB/png
export CMAKE_PREFIX_PATH=$PNG_ROOT:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$PNG_ROOT:$LD_LIBRARY_PATH
wget -O libpng-1.6.34.tar.gz \
ftp://ftp-osl.osuosl.org/pub/libpng/src/libpng16/libpng-1.6.34.tar.gz
tar -xf libpng-1.6.34.tar.gz
cd libpng-1.6.34
CPPFLAGS=-I$ZLIB_ROOT/include LDFLAGS=-L$ZLIB_ROOT/lib \
./configure --enable-static --enable-shared \
--prefix=$PNG_ROOT
make -j4
make install
cd ..
# build pngwriter
git clone https://github.com/pngwriter/pngwriter.git pngwriter/
cd pngwriter
cmake -DCMAKE_INSTALL_PREFIX=$PNGWRITER_ROOT
make install
DO not forget to add this to your picongpu profile
export ZLIB_ROOT=$PARTITION_LIB/zlib
export CMAKE_PREFIX_PATH=$ZLIB_ROOT:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$ZLIB_ROOT:$LD_LIBRARY_PATH
export PNG_ROOT=$PARTITION_LIB/png
export CMAKE_PREFIX_PATH=$PNG_ROOT:$CMAKE_PREFIX_PATH
export LD_LIBRARY_PATH=$PNG_ROOT:$LD_LIBRARY_PATH
Hello @MaxThevenet . Regarding launching, yes, tbg is a wrapper for batch submission systems (such as slurm) so that there are unified PIConGPU workflows on all machines. The JUWELS profile you linked above should be ready to work with tbg out-of-the-box.
Thanks a lot, I will try this. In the meantime, should I be able to run a test simulation? I tried
$ pic-create $PIC_EXAMPLES/LaserWakefield ./myLWFA
$ cd myLWFA
$ pic-build # compiled successfully
$ tbg -s sbatch -c etc/picongpu/1.cfg -t etc/picongpu/juwels-jsc/gpus.tpl $SCRATCH/picongpu/test-001
and the tbg command takes several minutes, is this expected?
Then, I get two errors:
sbatch: unrecognized option '--workdir=***/picongpu/test-002', which I can fix by manually deleting the line #SBATCH --workdir=***/test-001 from submit.start.submit.start is spelled chh*** instead of hh*** at line #SBATCH --account=hh***. I manually updated this line.Then, manually submitting the job with sbatch submit.start works well. The job is still in the queue, I'll let you know if it runs successfully.
Hm, the Slurm option --workdir is never mentioned in the Juwels doc, I am wondering if it is supported.
@MaxThevenet The looks like the SLURM batch system got an update. --workdir must be --chdir I will open an PR.
Ah great, thanks.
So when submitting sbatch submit.start the simulation runs properly provided I delete all png... options from submit.start (otherwise I get error unrecognised option '--e_png.period', I guess it's because I didn't compile PNGwriter). However, I get a couple of warnings, not sure if they are due to Juwels issues or something in PIConGPU:
[thevenet1@juwels00 test-001]$ cat stderr
/var/spool/parastation/jobs/2642537: line 1: /opt/jsc/sbin/jutil/jutil.exe: No such file or directory
no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test
Note that
[thevenet1@juwels00 test-001]$ ls -l /opt/jsc/sbin/jutil/jutil.exe
returns
-rwsr-xr-x 1 root root 7898088 Aug 6 16:40 /opt/jsc/sbin/jutil/jutil.exe
@psychocoderHPC I tried your script, adding export PNGWRITER_ROOT=$PARTITION_LIB/pngwriter somewhere in there, and pngWriter installed well. Then PIConGPU compiles well with PNGWriter and runs fine, creating png images \o/.
One caveat though, I had multiple errors like
unrecognised option '--e_phaseSpace.space'
etc. with all --e_phaseSpace.* options. However, I do get images e_png_yx_0.5_00*.png, and they look great!!
Thanks a lot for your help.
Overall, we encountered a few things:
--workdir should be replaced by --chdir--e_phaseSpace.*** aren't recognized/var/spool/parastation/jobs/2642537: line 1: /opt/jsc/sbin/jutil/jutil.exe: No such file or directory
no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test
Should I open new separate issues for each of these?
Cheers
Regarding the
[thevenet1@juwels00 test-001]$ cat stderr
/var/spool/parastation/jobs/2642537: line 1: /opt/jsc/sbin/jutil/jutil.exe: No such file or directory
no binary 'cuda_memtest' available or compute node is not exclusively allocated, skip GPU memory test
output. I believe it does not concern jutil itself, but this message is generated by our submission script. cuda_memtest is our separate small program to test GPU memory "health" before starting PIConGPU. However, we can only run it when a node is fully allocated by your job. It does not affect PIConGPU itself, just skips the memory test and starts PIConGPU immediately. Btw we already improved the clarity of this message in the dev branch a couple of days ago.
(otherwise I get error unrecognised option '--e_png.period', I guess it's because I didn't compile PNGwriter
Indeed. A lot of dependencies are optional, and the corresponding plugins are #ifdef -guarded, thus e.g. PNG plugin just does not exist when PNGWriter dependency was not found by cmake, and thus the options also do not exist.
Edit: now I see it's outdated as you managed to add the dependency later. However, this logic is general and applies to other plugins as well.
Regarding the phase space plugin, do you have HDF5 dependency (edit: actually libSplash dependency which in turn depends on HDF5)? The plugin so far relies on that, however not for long as discussed in #3357
Thank a lot for clarifying these things. Indeed libSplash seems to have been installed succesfully. My current workflow is to:
pic-create, pic-build, tbg)tbg command crashes, modify the slurm script by hand:phaseSpace options (? to what I understand, this should work, I don't know how to fix this)What are the actions items to make these easier (essentially, make the tbg command successful by fixing points a, b and c)? Should I open PRs to fix the easy issues, or wait for some PRs to be merged? Do you need more info to understand the libSplash behavior?
Thanks for summarizing. My thoughts are below.
Generate the files (pic-create, pic-build, tbg)
That's right!
project id
In case you know a general solution to express it via environment variables of juwels, please share here or via PR (I believe we tried to do that, but apparently turned out not general). In case the project number needs to be hard-coded, you could try the following way. Copy the ..._picongpu.profile.example somewhere to your home directory, modify there (maybe not just project number, but also e.g. email notification settings) and source that file when starting a terminal to work with PIConGPU. It will still connect to the correct .tpl file out-of-the-box, no manual work should be required there. With this approach, however, you need to follow the changes made to this file in the repository.
chdir (I can open a PR for it, if you want)
That would be very welcome!
remove all phaseSpace options (? to what I understand, this should work, I don't know how to fix this)
I am not sure what is the issue there. Will now take a look on our system first. Meanwhile, if you already have a functioning .cfg file after manual changes, could just copy this file inside your new input directory. Also it is possible to use pic-create to create one input directory from another one (not necessarily from inside our repository). In this case just make sure that after you "copy" (with pic-create), the new directory may have the PIConGPU binary built from the old one. To remove it and re-build (if you modify .param files), remove the .build subdirectory and run pic-build again.
Sorry, I perhaps misunderstood one point. To clarify, you can disable the phaseSpace plugin (that does not work for unknown yet reason) before submitting a job with tbg. In order to do so, modify your .cfg file that you submit to tbg in a text editor beforehand by removing the related options, e.g. either make this variable empty or just don't use it by removing this line
So the contents of .cfg files are directly mapped to the command line parameters of PIConGPU, however they are normally (relatively) more human-readable in a .cfg since grouped by variables and have comments.
Regarding the phaseSpace. I could not reproduce it on Hemera so far.
@MaxThevenet to help investigating, could you provide output of the following commands. Please run them from a directory with compiled PIConGPU (after pic-build) that causes the unrecognised option '--e_phaseSpace.space' error later:
.build/picongpu -v
and
.build/picongpu -h
Btw, generally -h should output all plugins that are available for the current PIConGPU build, which means for dependencies used and species defined.
Note: The phaseSpace plugin requires libsplash as dependency.
@MaxThevenet do you compiled libsplash by your own?
From that message it seemed so. That's why I would like to see the -v and -h outputs.
@Anton-Le Could you please check whether SLURM uses workdir or chdir?
See: https://github.com/ComputationalRadiationPhysics/picongpu/issues/3365#issuecomment-698363124
OK so pic-build couldn't find Splash indeed (here are the output of .build/picongpu -v and .build/picongpu -h), because I installed it following these instructions but the default profile contains LIBSPLASH_ROOT=$PARTITION_LIB/libsplash. When fixing this (and other issues mentioned above), pic-build runs with
-- Found Splash: /p/home/jusers/thevenet1/juwels/lib/splash/lib/cmake/Splash
and the tbg command runs well, and submits the run as expected (will still have to see if it runs well, but the queue on Juwels is a bit long. BTW, is there an option that I can pass to tbg to to use #SBATCH --partition=develgpus instead of #SBATCH --partition=gpus?).
@MaxThevenet Yes there is such an option, by overwriting the default -s flag of tbg. Instead of tbg -s -t ... you use tbg -s "sbatch --partition=develgpus" -t ....
As long as system setting (GPUs per node, memory, etc.) are the same for the partitions develgpus and gpus, this should work.
Wonderful, thanks!
Or alternatively you can change it here. This value is passed to --partition on line 25 of that file.
Normally we make a separate pair of .profile and .tpl files for each partition on a system.
Hi everyone, since the latest Juwels update, some module versions are newer than required by PIConGPU according to https://picongpu.readthedocs.io/en/0.5.0/install/dependencies.html
GCC/9.3.0
CUDA/11.0
Boost/1.74.0
While some module versions are compatible (I did not list all) and others may be installed from source to circumvent this issue (e.g. boost), GCC and CUDA are newer than supported by PIConGPU.
We assume this is the reason why pic-build fails.
Do you know how to fix this or do you have by chance a working PIConGPU version for the new module versions?
Thank you very much.
@Anton-Le Could you please comment on that. You are currently the one running on JUWELS.
@TheresaBruemmer with reference to your email: In an offline discussion, @Anton-Le just tod me, that he had no issues with the "newer" boost version 1.74.0.
pic-buils seems to run smoothly until 85%. Then, I get:
[ 85%] Built target cuda_memtest
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
CMake Error at cupla_generated_stream.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/./cupla_generated_stream.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:4841: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/cupla_generated_stream.cpp.o] Fehler 1
make[2]: *** Es wird auf noch nicht beendete Prozesse gewartet....
CMake Error at cupla_generated_Driver.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/manager/./cupla_generated_Driver.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:3253: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/manager/cupla_generated_Driver.cpp.o] Fehler 1
CMake Error at cupla_generated_event.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/./cupla_generated_event.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:2459: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/cupla_generated_event.cpp.o] Fehler 1
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:33: Fehler: »__T0« wurde in diesem Gültigkeitsbereich nicht deklariert; meinten Sie »__y0«?
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^~~~
| __y0
/p/home/jusers/bruemmer1/juwels/src/picongpu/thirdParty/alpaka/include/alpaka/core/Concepts.hpp:81:47: Fehler: Templateargument 1 ist ungültig
81 | static_assert(std::is_base_of<type, TDerived>::value, "The type implementing the concept has to be a publicly accessible base class!");
| ^
CMake Error at cupla_generated_common.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/./cupla_generated_common.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:871: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/cupla_generated_common.cpp.o] Fehler 1
CMake Error at cupla_generated_memory.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/./cupla_generated_memory.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:4047: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/cupla_generated_memory.cpp.o] Fehler 1
CMake Error at cupla_generated_device.cpp.o.Release.cmake:280 (message):
Error generating file
/p/home/jusers/bruemmer1/juwels/picInputs/myLWFA/.build/CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/./cupla_generated_device.cpp.o
make[2]: *** [CMakeFiles/cupla.dir/build.make:1665: CMakeFiles/cupla.dir/__/__/thirdParty/cupla/src/cupla_generated_device.cpp.o] Fehler 1
make[1]: *** [CMakeFiles/Makefile2:167: CMakeFiles/cupla.dir/all] Fehler 2
make: *** [Makefile:149: all] Fehler 2
ERROR: Could not successfully run make install in build directory:
.build
@TheresaBruemmer I have never seen such a compile error. @psychocoderHPC, @sbastrakov and @Anton-Le have you ever encountered something like this (in Juelich).
As a recap: Please be aware that CUDA/11.0 and Boost/1.74.0 are used here.
From the error log, this is alpaka error, not PIConGPU. I did not encounter it before, however gcc 9.3 is not officially supported by alpaka (nor by PIConGPU).
@TheresaBruemmer I have quickly tested on our dev system, which seems to have the same software versions. The 0.5.0 release (same as branch master) indeed does not build because of alpaka. However, the dev branch works for me, could you try it?
@sbastrakov and @PrometheusPi Great, thank you. I pulled the dev again and now it works!
@sbastrakov Thanks for testing 0.5.0 and dev
Most helpful comment
@sbastrakov Thanks for testing 0.5.0 and dev