Hi, any thoughts on the new sparsity features of the nvidia Ampere gpus? Looks like it could give a big speed improvement if applicable:
Even without sparsity, for YOLOv4 the training will be 6x times faster on Ampere, since there is TF32 for all Ampere GPUs (RTX 3070 - 3090, Tesla A100). And inference will be 2x faster.
And yes, it seems Sparsity can be used for Pruning, just to prune (set to zero) the smalles 2 of 4 sequential weights values, and it should increase inference time 2x (so in total inference will be 4x times faster than by using Turing).
NVIDIA has developed a simple and universal recipe for sparsifying deep neural networks for
inference using this 2:4 structured sparsity pattern. The network is first trained using dense
weights, then fine-grained structured pruning is applied, and finally the remaining non-zero
weights are fine-tuned with additional training steps. This method results in virtually no loss in
inferencing accuracy based on evaluation across dozens of networks spanning vision, object
detection, segmentation, natural language modeling, and translation.

Cool. The fp16 performance doesn't seem to be that much better than a 2080ti (e.g. this bench on resnet but maybe when the tensorrt support comes out that will change? I don't know.
Most helpful comment
Even without sparsity, for YOLOv4 the training will be 6x times faster on Ampere, since there is TF32 for all Ampere GPUs (RTX 3070 - 3090, Tesla A100). And inference will be 2x faster.
And yes, it seems Sparsity can be used for Pruning, just to prune (set to zero) the smalles 2 of 4 sequential weights values, and it should increase inference time 2x (so in total inference will be 4x times faster than by using Turing).
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf