Hi Gleb,
I'm Chengyi. Thank you for developing this magnificent project. It is really helpful for studying FDTD and its concurrency. But I encountered some problems in using the MPI/GPU computing.
./Release/Source/fdtd3d --cmd-from-file ./Examples/vacuum3D_test.txt
--use-cuda
--cuda-buffer-size 1
--cuda-gpus 0
--num-cuda-threads-x 4
--num-cuda-threads-y 4
--num-cuda-threads-z 4
the program will only show the log "Loading command line from file ./Examples/vacuum3D_test.txt
" and keep waiting untill I kill it. I'm just wondering if there're some configurations I did't set correctly?
By the way, this is my cmake flags in case you need it:
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DVALUE_TYPE=f -DPRINT_MESSAGE=ON -DCUDA_ENABLED=ON -DCUDA_ARCH_SM_TYPE=sm_60 -DCXX11_ENABLED=ON -DPARALLEL_GRID=ON -DPARALLEL_GRID_DIMENSION=3
If you need more simulation details, please let me know. Thank you very much.
With many thanks and wishes,
Chengyi
RelWithDebInfo
mode. Besides, full log is printed at the end of exectuion in case of Cuda launches.Small tip: it is faster (in terms of compilation and execution) to specify -DSOLVER_DIM_MODES
, if you know exactly which modes you will use. By default all modes are compiled into binary, which significantly increases compilation time in case of Cuda builds. In your case -DSOLVER_DIM_MODES=DIM3
would be enough.
However, on systems with shared memory share operations are not required at all (except for thread synchronization). That's why OpenMP is far more applicable here than MPI, and MPI based programs won't show the best speedup. Unfortunately, OpenMP is not yet supported in fdtd3d.
With all this in mind, there are still things to tweak in fdtd3d.
-DPARALLEL_BUFFER_DIMENSION=xyz
, which will divide grid between chunks in all dimensions. Check fdtd3d output because it advises the optimal virtual topology. --buffer-size B
, and share operation will be performed only each B
steps. In this case optimal virtual topology, which is advised by fdtd3d, is not guaranteed to be optimal. But you can set virtual topology manually with --manual-topology --topology-sizex X --topology-sizey Y --topology-sizez Z
.Note that when number of processes is not a divider of the overall size of the grid, optimal virtual topology, advised by fdtd3d, is also not guaranteed to be optimal.
Thank you for this timely reply.
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DVALUE_TYPE=f -DPRINT_MESSAGE=ON -DCUDA_ENABLED=ON -DCXX11_ENABLED=ON -DPARALLEL_GRID=ON -DPARALLEL_GRID_DIMENSION=3 -DSOLVER_DIM_MODES=DIM3 -DPARALLEL_BUFFER_DIMENSION=x
and it worked in my workstation. (BTW, my workstation have four Tesla P100s of which the arch should be SM_60)
mpiexec --mca btl ^openib -n 2 ./Release/Source/fdtd3d --cmd-from-file ./Examples/vacuum3D_test.txt
--use-cuda
--cuda-buffer-size 2
--buffer-size 2
--cuda-gpus 0,1
--num-cuda-threads-x 4
--num-cuda-threads-y 4
--num-cuda-threads-z 4
error occurs as shown:
Calculating time step 0...
Calculating time step 1...
Fatal error: an illegal memory access was encountered at /home/t00540502/fdtd3d/Source/Scheme/InternalScheme.inc.h:912
*** FAILED - ABORTING
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
It seems the first step was fine but something wrong with the function InternalSchemeKernelHelpers::calculateFieldStepIterationKernel
calling in the 2nd step. I'm wondering there must be some mistakes I've made in using the MPI+CUDA. Can you please teach me how to call it correctly?
Thanks a lot.
Best
Maybe something is wrong with sm_60 arch, I have not tested it. Default arch is sm_20, so it should work on your cards. However, sometimes I have seen illegal memory access
when cuda arch did not match GPU's compute capability.
Multi-gpu computations have limited applicability. If all data fits in memory of a single GPU, then it would be much faster to perform computations on this single GPU on a single computational node (because there will be no intermediate data sharing between CPU/GPU and between different computational nodes).
But when grids are very large and can't be located in memory of a single computational node, there is no choice but to use multiple computational nodes, each one possibly having a GPU. In this case CPU-GPU and CPU-CPU data sharing will be performed each B
steps, by default B=1
. This is much slower than plain computations of all time steps on a single GPU.
Multi-gpu computations are not yet fully supported in fdtd3d, because currently fdtd3d relies on the user to be sure that all data fits where it should fit (as well as in CPU only mode). So, with such limitations it should work.
@solotcy There was a bug with unset arch (see #140). Please, check with sm_60 on that PR.
Illegal memory access
is related to access to global variable cudaSolverSettings, which is located in device memory (see INTERNAL_SCHEME_BASE<Type, TCoord, layout_type>::calculateFieldStepIteration
):
if (SOLVER_SETTINGS.getDoUseTFSF ())
On 2 of 3 GPUs with the same compute capability sm_35, on which I have tested fdtd3d, everything works fine (all GPUs are different models). However, on one for some reason cudaSolverSettings becomes NULL, when entering getDoUseTFSF
method (i.e. this ptr is NULL). I have not been able to understand why this happens, but from what I've found this may happen because of device malfunction.
It looks like you have been able to successfully launch fdtd3d at least on one of your 4 GPUs. Try GPU-only mode on each GPU separately. Because all your 4 GPUs are exactly the same, there should be no difference at all in the fdtd3d behavior. It there is a difference, then the cause being a device malfunction becomes more probable.
Thanks a lot for the replies.
I've checked the PR #140 and rebuild fdtd3d with flag -DCUDA_ARCH_SM_TYPE=sm_60 added. Unfortunately, same problem happens.
However, as you mentioned, I've encountered with same error when I switched between different GPUs on single GPU mode. And it's not at the first step but the second, as before,
Estimated current size: 1437644553 byte.
Setup blocks:
blockCount:
Coord (X : 1.000000, Y : 1.000000, Z : 1.000000).
blockSize:
Coord (X : 200.000000, Y : 200.000000, Z : 200.000000).
Calculating time step 0...
Calculating time step 1...
Fatal error: an illegal memory access was encountered at ~/fdtd3d/Source/Scheme/InternalScheme.inc.h:912
And things become kind of strange as it only worked at the first GPU. For other three, it all failed and had the same error info.
Thanks for your tests! I was able to finally figure out the core reason of this problem. PR #141 solves the issue. Now fdtd3d should work on all your GPUs. Multi-gpu mode seems to work too now.
Thanks for your replies and the modified code. I've been able to run the program on my workstation with GPUs as many as I want. Then I can test for the scalibility on both CPUs and GPUs.
It's pretty cool, thanks!!!
Feel free to reopen this issue if have more questions.