Hello!
Just started to learn how to compute with PIConGPU so there are always some troubles to run the programm. The one of them is next issue:
Like you write in documentation, try to build and run the LaserWakefield example. Building fase past by without any errors, but the output of a running command ( tbg -s bash -c etc/picongpu/1.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/lwfa_001 )was like:
Running program...
==> Error: Spec '[email protected]%[email protected]~adios+hdf5~isaac+png backend=cuda cudacxx=nvcc arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~coroutine+date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave cxxstd=11 visibility=hidden arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~doc+ncurses+openssl+ownlibs~qt arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~cxx~debug~fortran~hl~java+mpi+pic+shared~szip~threadsafe api=none arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~ipo+mpi build_type=RelWithDebInfo patches=669608721dfce0ada7cef1ac84344352791a8916b7bb98ca8a0d4e6d4670e744 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~python arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~symlinks+termlib arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~java~legacylaunchers~lustre~memchecker~pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+systemcerts arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~pic arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+optimize+pic+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+pic arch=linux-ubuntu16.04-haswell' matches no installed packages.
[11/24/2020 02:01:04][lightning2][0]:ERROR: CUDA error: CUDA driver version is insufficient for CUDA runtime version, line 279, file /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/picongpu-0.5.0-c235ttjtevzctwzmpe3clm5bskxizbxe/thirdParty/cuda_memtest/cuda_memtest.cu
cuda_memtest crash: see file /home/astashkin/runs/lwfa_001/simOutput/cuda_memtest_lightning2_0.err
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[58063,1],0]
Exit code: 1
--------------------------------------------------------------------------
Sorry for some probable misunderstanding and thank you in advance for your help!
All the best, Egor
picongpu was installed from spack, os - Ububntu 16.04
Hello @zimaaaaa , thanks for trying out PIConGPU.
From the installation and launching part everything seems good to me. I suspect the error appears as GPU driver on your system does not support CUDA 10.2 that was used to compile. To investigate, you could check with NVIDIA's documentation, or maybe try running some sample CUDA application not related to our software stack (e.g. an example from their SDK) - I suspect the output would be the same. If this is indeed the case, please update the driver. Otherwise, please write back and we will try to help further investigation.
To add up: in case the driver is indeed outdated, but you cannot update it, you could also try installing PIConGPU with an earlier version of CUDA, that is supported by your driver. To do so, you can use +cuda@[version] in your spack install picongpu command. That would make another spack installation of PIConGPU.
It is also possible to build and run PIConGPU without CUDA, and we could help with it, but I assume that's not your intention on a GPU-enabled system.
I have on my server Ubuntu 16.04, so the trouble was actually in drivers' compability, because on this OS I can install only 430 driver, while CUDA 10.2 requires 440 and higher. So there is actually 1 issue left right now: spack can't concretize cuda version on the install command
If i try spack install picongpu%[email protected][email protected] (or another version) it gives me next output:
Error: trying to set variant "cuda" in package "picongpu", but the package has no such variant [happened during concretization of [email protected]%[email protected]+cuda]
If i try to concretize picongpu version (to avoid [email protected] in the command), so i have next input-output pair:
IN: spack install [email protected]%[email protected][email protected]
OUT: Error: A spec cannot contain multiple version signifiers. Use a version list instead.
Sorry, my bad @zimaaaaa . You are right, our config does not have such a version set. I think there are two ways of dealing with it.
First, you can tell spack to use the CUDA version already installed at your system. To do so, one just needs to create a description .yaml file, as described in spack documentation. I think this would be a better option, as it worked fine for some other users of PIConGPU + spack and would just re-use the CUDA you already have. In this case when installing picongpu you will see something like ==> cuda@[version] : externally installed in [your local path].
Alternatively, you could try modifying our spack settings manually to use a fitting CUDA version. To do so, please modify one of these lines according to your PIConGPU version in your local clone of this repo. I am not sure if spack gets this automatically, or you would need to do spack repo remove and then spack repo add.
Thanks for your help! I installed picongpu from spack with your 2nd way (changing cuda parameters in packages.py) and now LWFA example runs in the correct way. So this alternative is a good method too!
However, then I tried to run some other examples, Transition Radiation and etc. And there were some troubles with Bunch example. While I run it in the common way: tbg -s bash -c etc/picongpu/32.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/bunch_001 -f programm gives me the next message:
Warning: using existing folder on user-request [-f]
Running program...
==> Error: Spec '[email protected]%[email protected]~adios+hdf5~isaac+png backend=cuda cudacxx=nvcc arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~coroutine+date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave cxxstd=11 visibility=hidden arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~doc+ncurses+openssl+ownlibs~qt arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~cxx~debug~fortran~hl~java+mpi+pic+shared~szip~threadsafe api=none arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~ipo+mpi build_type=RelWithDebInfo patches=669608721dfce0ada7cef1ac84344352791a8916b7bb98ca8a0d4e6d4670e744 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~python arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~symlinks+termlib arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94 arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~java~legacylaunchers~lustre~memchecker~pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+systemcerts arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected] arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]~pic arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+optimize+pic+shared arch=linux-ubuntu16.04-haswell ^[email protected]%[email protected]+pic arch=linux-ubuntu16.04-haswell' matches no installed packages.
--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:
App: /home/astashkin/runs/bunch_001/input/bin/cuda_memtest.sh
Number of procs: 32
PPR: 8:node
Please revise the conflict and try again.
--------------------------------------------------------------------------
Also I find some interesting details in output of pic-build:
Can NOT find 'adios_config' - set ADIOS_ROOT, ADIOS_DIR or INSTALL_PREFIX, or check your PATH
-- Could NOT find ADIOS (missing: ADIOS_LIBRARIES ADIOS_INCLUDE_DIRS) (Required is at least version "1.13.1")
-- Found HDF5: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/hdf5-1.10.7-7xluivzfjcklfuioyhrdpssjtyrqsfld/lib/libhdf5.so;/home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/zlib-1.2.11-5q6bhs5vfgvppoadvm3jczuxb255wei3/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.10.7")
-- Found Splash: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/libsplash-1.7.0-z4l73ji6q4ekl3twro2u7kzmecwyu3zu/lib/cmake/Splash
-- Found PNG: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/libpng-1.6.37-3fiqfh5dbhr7nqsejqhboifpl3hwfsaw/lib/libpng.so (found version "1.6.37")
-- Found Freetype: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/freetype-2.10.1-jsq3hqsjl3g3vklfaki5x3jnf63vtwbx/lib/libfreetype.so (found version "2.10.1")
-- Found PNGwriter: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/pngwriter-0.7.0-izijelgfzxjuy4eam5f3wig43w547w45/lib/cmake/PNGwriter
-- Could NOT find ISAAC - set ISAAC_DIR or check your CMAKE_PREFIX_PATH
-- Found Boost: /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/boost-1.70.0-ssdzlrtiy5ezgja2epgpsxqqzaom44hj/lib/cmake/Boost-1.70.0/BoostConfig.cmake (found suitable version "1.70.0", minimum required is "1.65.1") found components: program_options
CMake Deprecation Warning at /home/astashkin/src/spack/opt/spack/linux-ubuntu16.04-haswell/gcc-7.3.0/picongpu-0.5.0-j2rhqyqd6q4xoow7gps6sc2gno5mp57p/thirdParty/cmake-modules/FindADIOS.cmake:91 (cmake_minimum_required):
Compatibility with CMake < 2.8.12 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
Call Stack (most recent call first):
CMakeLists.txt:242 (find_package)
Could you give me some tips why this happens? Also, could I address you with difficult questions with your program? I'm russian second-year physics student and i'm going to write my first semester paper with the help of your programm.
All the best,
Egor
I think cmake output is just a warning and is not related to your issue.
I am not sure on which system you are running, the following assumes that you are on a kind of workstation with e.g. 8 GPUs. In case you are on a multi-node cluster, please try this suggestion nonetheless, as then it would help investigate.
The issue is that you were trying to run 32 MPI ranks, and your system seems to support 8 per node. The number of MPI ranks you requested is defined by a file provided as the -c argument of tbg. In all of our examples, we name those files according to the number of MPI ranks, and when using GPUs 1 rank == 1 GPU. So I think trying a smaller example may work. The Bunch example is by nature a very computationally demanding one, so we only provide a 32.cfg. Since you already have it compiled, as a quick test that PIConGPU works at all you could manually modify the .cfg file (etc/picongpu/32.cfg inside your directory created by pic-create). There you need to change these lines to reduce the number of ranks (== GPUs) in x, y, z, with the total number being their product. And also reduce grid size at line 39 somewhat proportionally, or otherwise it would not fit GPU memory. Changing just a .cfg does not require rebuilding, you could just re-do the tbg command.
A smaller example would be LaserWakefield, which we use in the documentation, switching to it would require another pic-create + pic-build
OK, thank you very much for your advice! It really helped me a lot.
All the best,
Egor