Fdtd3d: Questions in parallel computing

Created on 8 Jan 2020 · 9Comments · Source: zer011b/fdtd3d

Hi Gleb,

I'm Chengyi. Thank you for developing this magnificent project. It is really helpful for studying FDTD and its concurrency. But I encountered some problems in using the MPI/GPU computing.

The problem is that when I call the fdtd3d code using the following command,
./Release/Source/fdtd3d --cmd-from-file ./Examples/vacuum3D_test.txt
in which I created "vacuum3D_test.txt" based on the "vacuum3D.txt" by inserting the following code

--use-cuda
--cuda-buffer-size 1
--cuda-gpus 0
--num-cuda-threads-x 4
--num-cuda-threads-y 4
--num-cuda-threads-z 4

the program will only show the log "Loading command line from file ./Examples/vacuum3D_test.txt
" and keep waiting untill I kill it. I'm just wondering if there're some configurations I did't set correctly?
By the way, this is my cmake flags in case you need it:
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DVALUE_TYPE=f -DPRINT_MESSAGE=ON -DCUDA_ENABLED=ON -DCUDA_ARCH_SM_TYPE=sm_60 -DCXX11_ENABLED=ON -DPARALLEL_GRID=ON -DPARALLEL_GRID_DIMENSION=3

Also, when I simulated the "vacuum3D.txt" by MPI, the scalability presented by fdtd3d is not very ideal. For example, the grid size is 40 by 40 by 40 which equals to 64000. And I have one chip which has 18 cores. When it goes with 1 process, it will cost 67.74 seconds and it will be around 11.34 seconds with 8 processors enabled. The speedup is around 6. And when I apply more processors like 18, the time reduction is trivial, say from 11.34 s to 9.6 s. Is this reasonable? Can I ask if there're simulation configurations that can tune the parallel computing performance ?

If you need more simulation details, please let me know. Thank you very much.

With many thanks and wishes,
Chengyi

Question

Source

solotcy

All 9 comments

I don't think that it is stuck, it just performs computations, which are much slower even in RelWithDebInfo mode. Besides, full log is printed at the end of exectuion in case of Cuda launches.

Small tip: it is faster (in terms of compilation and execution) to specify -DSOLVER_DIM_MODES, if you know exactly which modes you will use. By default all modes are compiled into binary, which significantly increases compilation time in case of Cuda builds. In your case -DSOLVER_DIM_MODES=DIM3 would be enough.

There are few things to keep in mind here. First of all, overall execution time of each time step is the sum of computational time and share time. When grid is relatively small, like in your case, share time could be significant and careful choice of virtual topology is required (fdtd3d shows the best virtual topology for the specified grid size in its output).

However, on systems with shared memory share operations are not required at all (except for thread synchronization). That's why OpenMP is far more applicable here than MPI, and MPI based programs won't show the best speedup. Unfortunately, OpenMP is not yet supported in fdtd3d.

With all this in mind, there are still things to tweak in fdtd3d.

By default only Ox axis is spread between computational nodes, but you can change this with -DPARALLEL_BUFFER_DIMENSION=xyz, which will divide grid between chunks in all dimensions. Check fdtd3d output because it advises the optimal virtual topology.
Size of buffer can be set up with --buffer-size B, and share operation will be performed only each B steps. In this case optimal virtual topology, which is advised by fdtd3d, is not guaranteed to be optimal. But you can set virtual topology manually with --manual-topology --topology-sizex X --topology-sizey Y --topology-sizez Z.

Note that when number of processes is not a divider of the overall size of the grid, optimal virtual topology, advised by fdtd3d, is also not guaranteed to be optimal.

zer011b on 9 Jan 2020

Thank you for this timely reply.

As for the single GPU computing, I remaked the fdtd3d with the following flags

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DVALUE_TYPE=f -DPRINT_MESSAGE=ON -DCUDA_ENABLED=ON -DCXX11_ENABLED=ON -DPARALLEL_GRID=ON -DPARALLEL_GRID_DIMENSION=3 -DSOLVER_DIM_MODES=DIM3 -DPARALLEL_BUFFER_DIMENSION=x

and it worked in my workstation. (BTW, my workstation have four Tesla P100s of which the arch should be SM_60)

When I want to enable the multi-gpu computing with
mpiexec --mca btl ^openib -n 2 ./Release/Source/fdtd3d --cmd-from-file ./Examples/vacuum3D_test.txt
and the CUDA cmds

--use-cuda
--cuda-buffer-size 2
--buffer-size 2
--cuda-gpus 0,1
--num-cuda-threads-x 4
--num-cuda-threads-y 4
--num-cuda-threads-z 4

error occurs as shown:

Calculating time step 0...
Calculating time step 1...
Fatal error: an illegal memory access was encountered at /home/t00540502/fdtd3d/Source/Scheme/InternalScheme.inc.h:912
*** FAILED - ABORTING
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

It seems the first step was fine but something wrong with the function InternalSchemeKernelHelpers::calculateFieldStepIterationKernel calling in the 2nd step. I'm wondering there must be some mistakes I've made in using the MPI+CUDA. Can you please teach me how to call it correctly?
Thanks a lot.

Best

solotcy on 11 Jan 2020

Maybe something is wrong with sm_60 arch, I have not tested it. Default arch is sm_20, so it should work on your cards. However, sometimes I have seen illegal memory access when cuda arch did not match GPU's compute capability.

Multi-gpu computations have limited applicability. If all data fits in memory of a single GPU, then it would be much faster to perform computations on this single GPU on a single computational node (because there will be no intermediate data sharing between CPU/GPU and between different computational nodes).

But when grids are very large and can't be located in memory of a single computational node, there is no choice but to use multiple computational nodes, each one possibly having a GPU. In this case CPU-GPU and CPU-CPU data sharing will be performed each B steps, by default B=1. This is much slower than plain computations of all time steps on a single GPU.

Multi-gpu computations are not yet fully supported in fdtd3d, because currently fdtd3d relies on the user to be sure that all data fits where it should fit (as well as in CPU only mode). So, with such limitations it should work.

zer011b on 11 Jan 2020

@solotcy There was a bug with unset arch (see #140). Please, check with sm_60 on that PR.

zer011b on 11 Jan 2020

Illegal memory access is related to access to global variable cudaSolverSettings, which is located in device memory (see INTERNAL_SCHEME_BASE<Type, TCoord, layout_type>::calculateFieldStepIteration):

if (SOLVER_SETTINGS.getDoUseTFSF ())

On 2 of 3 GPUs with the same compute capability sm_35, on which I have tested fdtd3d, everything works fine (all GPUs are different models). However, on one for some reason cudaSolverSettings becomes NULL, when entering getDoUseTFSF method (i.e. this ptr is NULL). I have not been able to understand why this happens, but from what I've found this may happen because of device malfunction.

It looks like you have been able to successfully launch fdtd3d at least on one of your 4 GPUs. Try GPU-only mode on each GPU separately. Because all your 4 GPUs are exactly the same, there should be no difference at all in the fdtd3d behavior. It there is a difference, then the cause being a device malfunction becomes more probable.

zer011b on 12 Jan 2020

Thanks a lot for the replies.

I've checked the PR #140 and rebuild fdtd3d with flag -DCUDA_ARCH_SM_TYPE=sm_60 added. Unfortunately, same problem happens.

However, as you mentioned, I've encountered with same error when I switched between different GPUs on single GPU mode. And it's not at the first step but the second, as before,

Estimated current size: 1437644553 byte.
Setup blocks:
blockCount:
Coord (X : 1.000000, Y : 1.000000, Z : 1.000000).
blockSize:
Coord (X : 200.000000, Y : 200.000000, Z : 200.000000).
Calculating time step 0...
Calculating time step 1...
Fatal error: an illegal memory access was encountered at ~/fdtd3d/Source/Scheme/InternalScheme.inc.h:912

And things become kind of strange as it only worked at the first GPU. For other three, it all failed and had the same error info.

solotcy on 13 Jan 2020

👍1

Thanks for your tests! I was able to finally figure out the core reason of this problem. PR #141 solves the issue. Now fdtd3d should work on all your GPUs. Multi-gpu mode seems to work too now.

zer011b on 13 Jan 2020

👍1

Thanks for your replies and the modified code. I've been able to run the program on my workstation with GPUs as many as I want. Then I can test for the scalibility on both CPUs and GPUs.
It's pretty cool, thanks!!!

solotcy on 14 Jan 2020

🎉1

Feel free to reopen this issue if have more questions.

zer011b on 14 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings