Hi there,
Does Mixed-Precision have benefits to GTX cards, or it's only beneficial for the RTX cards?
In general, you do need tensor cores to benefit. Some non-Volta cards (like the P100) can benefit from half-precision arithmetic for certain networks, but the numerical stability is much less reliable (even with Apex tools) and the P100 is not a GTX in any case.
What GTX card do you have exactly? Typically the GTX line is not built for half precision throughput. For example, if you have a 1080Ti, the half precision throughput is quite poor (although the FP32 throughput is quite good).
According to phoronix the performance of 1080 ti & 2070 at fp16 is very close.
So you mean that when using mixed-precision training, the 2070 would out-perform the 1080 ti at fp16?
But how much is the difference between those 2 cards in mixed training?
Those results from phoronix are surprising to me. 1080 Ti is compute capability 6.1, which is known to have terrible FP16 throughput relative to other compute capabilities. See the row "16-bit floating-point add, multiply, multiply-add" for compute capability 6.1 here.
In mixed training, the idea is that most of your operations are still carried out in FP16, while only a few operations are done in FP32 for safety. So yes, I would expect the 2070 to greatly outperform the 1080 on FP16 training.
I'm not quite sure what Phoronix measured, but read the following from Anandtech:
"GeForce GTX 1080, on the other hand, is not faster at FP16. In fact it鈥檚 downright slow. [...] GTX 1080鈥檚 FP16 instruction rate is 1/128th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64th the FP32 rate, or about 138 GFLOPs."
For Pascal, Anandtech is correct - GTX 10 series was not meant to execute fp16 faster than fp32, it was supposed to have been there only for numeric experimentation ability. It was only the Tesla (P100) and Tegra (TX2) series of GPUs that actually provided a 2x speed-up among the Pascal offering.
You can also check the CUDA developer guide - it says here that while for compute capability 6.0 (Tesla P100), the fused multiply-add throughput in fp16 is 128 per clock per SM, compared to 64 in fp32, for compute capability 6.1 (GTX series) the story is altogether different. In the CC 6.1 case, fp32 gives you 128 FMAs perr clock per SM, compared to only 2 FMAs per clock per SM for fp16. For compute capability 6.2, you get the fp16 performance back up, but CC 6.2 is Tegra X2, not GeForce. I recommend checking the vendor's specs in the future rather than just third-party reviews.
As @mcarilli mentioned, the fp16 numerics of Pascal are also different than Volta's and Turing's.Tensor cores present in Tesla V100, Titan V and 2070/2080/2080Ti do accumulation in fp32 for fp16 fused multiply-adds, which results in much better precision. Training in mixed precision already requires some effort (see here and here), which Apex fortunately automates (mixed-precision optimizer with dynamic loss scaling and amp subprojects). Dealing with additional precision issues due to fp16 accumulation in Pascal makes convergence harder than on Volta and Turing. It's possible, but it's harder, and as I mentioned, fp16 is slower than fp32 on GTX 10 series.
I'm reading this conversation with interest, and I'm also starting to think that the alleged bottleneck that Nvidia would allegedly have put upon Pascal consumer cards does not actually exist. Why? A lot of people (including myself) found that GTX Pascals run some 10/15% faster in fp16 WRT fp32.
Look at these links:
https://alisha17.github.io/machine-learning/2017/12/15/benchmarks.html
And myself:
Note that at least in my case the card was actually running in fp16, since the memory occupation was almost halved w.r.t fp32. Note the slightly retarded convergence too.
Can you explain this? Either we are making some fundamental mistake, or Nvidia is lying about "1/32" in GTXs.
It is possible to run mixed precision training on Pascal cards, and it is possible that it would provide modest speed-up if your network is heavy on bandwidth operations - because you would be reading and writing less data. It would definitely provide memory benefits that you are observing. However, at least if you are using pytorch, when running compute-heavy operations on Pascal cards, be it consumer cards (1080) or server cards (P100), the math will actually be done in fp32, so "1/32" fp16 throughput (that is GTX throughput for real fp16 math, not fp32) won't affect you.
Thanks for your reply. But if the actual math is done in fp32 like you said, the tensor entries have to be stored using 32-bits integers. Then, how can one observe such memory benefits? (I'm using pytorch).
Furthermore, the speedup benefits observed on pascal by such reviewers were a bit more than modest: they were almost on par with RTX cards (some 15% vs 20/25%).
From a bandwidth perspective, what matters is the traffic that needs to move across the dram bus from global memory to the compute cores. For FP16 tensors, this traffic is FP16. Once the data reaches the cores, it is stored in registers as FP32, operated on in FP32, and written back to dram once again as FP16. Register access is basically instant and not at all a "bandwidth" concern.
Most helpful comment
I'm not quite sure what Phoronix measured, but read the following from Anandtech:
"GeForce GTX 1080, on the other hand, is not faster at FP16. In fact it鈥檚 downright slow. [...] GTX 1080鈥檚 FP16 instruction rate is 1/128th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64th the FP32 rate, or about 138 GFLOPs."
For Pascal, Anandtech is correct - GTX 10 series was not meant to execute fp16 faster than fp32, it was supposed to have been there only for numeric experimentation ability. It was only the Tesla (P100) and Tegra (TX2) series of GPUs that actually provided a 2x speed-up among the Pascal offering.
You can also check the CUDA developer guide - it says here that while for compute capability 6.0 (Tesla P100), the fused multiply-add throughput in fp16 is 128 per clock per SM, compared to 64 in fp32, for compute capability 6.1 (GTX series) the story is altogether different. In the CC 6.1 case, fp32 gives you 128 FMAs perr clock per SM, compared to only 2 FMAs per clock per SM for fp16. For compute capability 6.2, you get the fp16 performance back up, but CC 6.2 is Tegra X2, not GeForce. I recommend checking the vendor's specs in the future rather than just third-party reviews.
As @mcarilli mentioned, the fp16 numerics of Pascal are also different than Volta's and Turing's.Tensor cores present in Tesla V100, Titan V and 2070/2080/2080Ti do accumulation in fp32 for fp16 fused multiply-adds, which results in much better precision. Training in mixed precision already requires some effort (see here and here), which Apex fortunately automates (mixed-precision optimizer with dynamic loss scaling and amp subprojects). Dealing with additional precision issues due to fp16 accumulation in Pascal makes convergence harder than on Volta and Turing. It's possible, but it's harder, and as I mentioned, fp16 is slower than fp32 on GTX 10 series.