pytorch 🚀 - GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.

I have the same problem. Toooooo sad

WangWenhao0716 on 19 Sep 2020

I have the same problem. Toooooo sad

I just installed cudatoolkit 11.0 and tried to install pytorch 1.6.0, but only pytorch-CPU version is available.

conda install pytorch torchvision cudatoolkit=11.0 -c pytorch
=======> this doesn't work

Anyone else facing the same problem?

swecomic on 19 Sep 2020

@swecomic It seems to work if you switch to the nightly builds, which also means it's the in-development 1.7.0, instead of the stable release (1.6.0).
conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly

I got RTX 3080 working on this configuration but I'm getting some stability issues. Training goes well for a few minutes and then hangs.

It also seems that NVIDIA maintains a docker image built with CUDA Toolkit 11.0. I'm yet to test this.

Sleepwalking on 19 Sep 2020

👍2

cc @ptrblck for training hangs on 3080. @Sleepwalking do you have a script to reproduce the issue?

ngimel on 19 Sep 2020

xwang233 on 20 Sep 2020

@ngimel Thanks for the attention. I find it hard to come up with a minimal example because it can take anywhere from 5 minutes to an hour to reproduce this issue, but I managed to get a trace of the callstack when it froze.

#0  futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x5571c4a1eb7c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5571c4a1eb80, 
    cond=0x5571c4a1eb50) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5571c4a1eb50, mutex=0x5571c4a1eb80)
    at pthread_cond_wait.c:638
#3  0x00007f418c8234cb in __gthread_cond_wait (__mutex=<optimized out>, 
    __cond=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/x86_64-conda_cos6-linux-gnu/bits/gthr-default.h:878
#4  std::condition_variable::wait (this=<optimized out>, __lock=...)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#5  0x00007f417bcbaffb in torch::autograd::ReadyQueue::pop() ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007f417bcbffb9 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007f417bcbe010 in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f4180b3d96c in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) () from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#9  0x00007f417bcbd195 in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f4180b3d76e in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#11 0x00007f4180b3e91e in THPEngine_run_backward(THPEngine*, _object*, _object*) ()
   from /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so

Full log is here.

The script on which I encountered this particular problem has been running for days on 1.6.0 with 20 series and 10 series cards without any issue. After upgrading to 1.7.0 + CUDA Kit 11 + Driver 455.23, it appears fine on RTX 2080 Ti but has this random freezing issue on RTX 3080. I'm still working to find a minimal reproducible example.

Sleepwalking on 20 Sep 2020

@Sleepwalking
Thanks. I'm trying training now and it's working fine without any problem so far.

swecomic on 20 Sep 2020

Please ignore my previous post. In the next few trials, the code always freezes during Tensor.item() calls, instead of the callstack posted above. So I might have attached to the wrong process. (Update: tried this a few more times and this time it indeed froze on futex_wait_cancelable as my previous post suggested. All threads are waiting and it appears to be a deadlock.)

#0  0x00007f3526fbadac in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007f3526e421bf in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f3526efa345 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f3527076b06 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f3527077611 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f3526f6f066 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007f3526f85047 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007f3526f858bf in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007f352707bc03 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f3526e2a9db in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007f3526e2b055 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#11 0x00007f3526e2cb02 in ?? () from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#12 0x00007f3526e9efde in cuMemcpyDtoHAsync_v2 ()
   from target:/usr/lib/x86_64-linux-gnu/libcuda.so.1
#13 0x00007f368cb70819 in ?? () from target:/usr/local/cuda/lib64/libcudart.so.11.0
#14 0x00007f368cb4e02d in ?? () from target:/usr/local/cuda/lib64/libcudart.so.11.0
#15 0x00007f368cb89df6 in cudaMemcpyAsync ()
   from target:/usr/local/cuda/lib64/libcudart.so.11.0
#16 0x00007f35e0e2c6ee in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007f35e0e2ecd7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007f35e21bf2b8 in at::CUDAType::_local_scalar_dense(at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007f35e21f0b81 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007f365cbf02f5 in at::_local_scalar_dense(at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007f365e04635c in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) () from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f365cb5d9a1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007f365cbf02f5 in at::_local_scalar_dense(at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007f365c7d3c8b in at::native::item(at::Tensor const&) ()
   from target:/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

https://pastebin.ubuntu.com/p/9JvhhXMzBD/

Here's a minimal example that reproduces this issue. Note that the chance of freezing is extremely low per iteration. You might need to wait for hours.

import torch
import torch.nn as nn
import torch.optim as optim

size = 128

class PayloadModel(nn.Module):
  def __init__(self):
    super().__init__()

    self.layers = nn.Sequential( \
      *[nn.Conv1d(size, size, 1, 1, 0) for i in range(10)])

  def forward(self, X):
    return self.layers(X)

device = "cuda:0"

model = PayloadModel().to(device)
optimizer = optim.Adam(model.parameters(), lr = 5e-4)

inputs  = torch.randn(32, size, 256, device = device)
targets = torch.randn(32, size, 256, device = device)
loss = nn.MSELoss()

for step in range(10000000):
  predicted = model(inputs)
  L = loss(predicted, targets)

  optimizer.zero_grad()
  L.backward()
  optimizer.step()

  print("%d steps, loss = %f" % (step, L.item()), end = "\r")

Sleepwalking on 20 Sep 2020

👍1

conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly
works very well for me!!!
And when I train my models using several 3080 GPUs, this version is much faster than the version compiled by me because when I compiled PyTorch from source, I do not use NCCL.
Thanks very much!!!

------------------ 原始邮件 ------------------
发件人: "pytorch/pytorch" <[email protected]>;
发送时间: 2020年9月19日(星期六) 晚上11:17
收件人: "pytorch/pytorch"<[email protected]>;
抄送: "王文昊"<[email protected]>;"Comment"<[email protected]>;
主题: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

@swecomic It seems to work if you switch to the nightly builds, which also means it's the in-development 1.7.0, instead of the stable release (1.6.0).
conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly

I got RTX 3080 working on this configuration but I'm getting some stability issues. Training goes well for a few minutes and then hangs.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 21 Sep 2020

I am currently using Windows10, python3.7, cuda11，and install pytorch1.7 using the following command:
_pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html_
The warning still exists, but the program can run normally and achieve the expected effect on the RTX3080 GPU.

MrStarry on 21 Sep 2020

👍11

Its computing performance is limited to some extent.

发自我的iPhone

------------------ Original ------------------
From: MrStarry <[email protected]>
Date: Tue,Sep 22,2020 0:42 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Comment <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

I am currently using Windows10, python3.7, cudn11, and installed pytorch1.7 using the following command:
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html
The warning still exists, but the program can run normally and achieve the expected effect

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 22 Sep 2020

conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly

This works for me while the one below doesn't.

I am currently using Windows10, python3.7, cudn11，and install pytorch1.7 using the following command:
_pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html_
The warning still exists, but the program can run normally and achieve the expected effect on the RTX3080 GPU.

ectersen on 22 Sep 2020

@Sleepwalking Re: https://github.com/pytorch/pytorch/issues/45028#issuecomment-695757514

I tried your script on a RTX 2080ti for 10+ hours, and on a RTX 3090 for 15+ hours, and I can not reproduce the freeze. Are you able to try it on a different device, or reinstall your driver/CUDA?

zasdfgbnm on 23 Sep 2020

Hi do you know when the cuda11 will support sm_86？

发自我的iPhone

------------------ Original ------------------
From: Gao, Xiang <[email protected]>
Date: Wed,Sep 23,2020 10:59 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Comment <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 23 Sep 2020

@WangWenhao0716 Sorry I can not disclose the release date of the new CUDA, but please keep an eye on https://developer.nvidia.com/cuda-downloads

zasdfgbnm on 23 Sep 2020

thanks～

发自我的iPhone

------------------ Original ------------------
From: Gao, Xiang <[email protected]>
Date: Wed,Sep 23,2020 11:09 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 23 Sep 2020

@zasdfgbnm Thank you very much for testing.

I nuked the system partition and reinstalled Ubuntu 20.04, 455.23.04 driver, CUDA 11.03 and then compiled PyTorch from source (with TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0+PTX"). This time I tried 1.6.0 just to see if it's a feature regression from 1.6.0 to the current master, but unfortunately the problem persisted.

Some other builds I tried but had the same problem include the NGC container and the 1.7.0 nightly build on conda.

A very weird thing I recently observed is that when one training script is running on RTX 3080, there's a chance of freezing another unrelated training script running on a different GPU on the same machine. This is not distributed computing but completely separate experiments on different CUDA devices. I wonder if this is not a PyTorch issue but the driver?

Sleepwalking on 23 Sep 2020

@Sleepwalking I don't know. I need to look deeper. But it doesn't look like a PyTorch bug to me.

zasdfgbnm on 23 Sep 2020

👍1

@Sleepwalking could you check, if ECC mode is on and if not enable it using these instructions and rerun your script, please?

ptrblck on 23 Sep 2020

@ptrblck I meant downloading the NGC docker image and running it locally on RTX 3080, which does not support ECC. Sorry for the confusion.

Sleepwalking on 23 Sep 2020

Use python3.8 and install the latest preview version today:
_https://download.pytorch.org/whl/nightly/cu110/torch-1.7.0.dev20200923%2Bcu110-cp38-cp38-win_amd64.whl_
Warning has disappeared.

MrStarry on 23 Sep 2020

wow! thanks you very much for the update and I will try it later!

发自我的iPhone

------------------ Original ------------------
From: MrStarry <[email protected]>
Date: Wed,Sep 23,2020 11:55 PM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 23 Sep 2020

@WangWenhao0716 My experiments show that the performance of the latest version seems to have improved compared to the September 18th version.

MrStarry on 23 Sep 2020

Ok！

发自我的iPhone

------------------ Original ------------------
From: MrStarry <[email protected]>
Date: Thu,Sep 24,2020 0:03 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 23 Sep 2020

So sm_86 basically is a new arch that even CUDA itself hasn't supported it yet. The nightly built PyTorch used '8.0+PTX' in the flags for forward-compatibility.

I'm a little bit confused since I thought CUDA itself IS forward-compatible without PTX?
For example, PyTorch 0.4.1 is compiled under CUDA 9.0 (sm_70) and the binary could directly run under CUDA 10.1 (sm_75) installation?

elmirador on 24 Sep 2020

@elmirador
Binary forward-compatibility is only guaranteed across minor versions, but not major upgrades. According to NVIDIA documentation,

For example, a cubin generated for compute capability 7.0 is supported to run on a GPU with compute capability 7.5, however a cubin generated for compute capability 7.5 is not supported to run on a GPU with compute capability 7.0, and a cubin generated with compute capability 7.x is not supported to run on a GPU with compute capability 8.x.

https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html

Sleepwalking on 24 Sep 2020

👍1

@Sleepwalking
Got it! Thanks for the clarification!

elmirador on 25 Sep 2020

Cuda 11.1 is released with compute capability 8.6 support. Is current master compatible with 11.1?

realiti4 on 25 Sep 2020

@realiti4 Yes, CUDA 11.1 supports 8.6, according to https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-general-new-features

Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.

zasdfgbnm on 25 Sep 2020

Hey @Sleepwalking, long time no see! :)

What RTX 3080 card do you have? Founder's edition or something from other manufacturers? Just wondering if the freezing problem you're observing is related to this issue. People have been reporting this issue with cards from other manufacturers.

Also, it may not be a bad idea to check the stability of your card with gpu-burn. I've found this tool to be super effective for pinpointing GPU instability issues (I had cards that passed all other tests, but failed on this one in under 20 minutes). Might be worth a shot...

Maghoumi on 27 Sep 2020

Hello @Maghoumi !
It's a bit of unfortunate that I got a Zotac Trinity, but it survives running gpu-burn for an hour (gpu_burn -tc 3600).
I also recompiled (yet again) PyTorch 1.6.0 on the latest CUDA 11.1 and upgraded the driver to 455.23.05. Freezing is still observed.

Sleepwalking on 27 Sep 2020

👍1

I think you should not use 1.6.0；1.7 is suitable

发自我的iPhone

------------------ Original ------------------
From: Kanru Hua <[email protected]>
Date: Sun,Sep 27,2020 11:13 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 27 Sep 2020

I also ran into an issue building cpp extensions, which warned that the 8_6 gencode wasn't supported. Just adding it directly to the list of supported gen codes in torch/utils/cpp_extension.py worked. Note that this was in 1.7 nightly, which I otherwise had no issues with on a 3080.

jacobaustin123 on 29 Sep 2020

torch-1.7.0.dev20200923%2Bcu110-cp38-cp38-win_amd64.whl

Hi. How do I install with whl file?

stephenllh on 1 Oct 2020

torch-1.7.0.dev20200923%2Bcu110-cp38-cp38-win_amd64.whl

Hi. How do I install with whl file?

Have you tried pip install torch-1.7.0.dev20200923%2Bcu110-cp38-cp38-win_amd64.whl?

jacobaustin123 on 1 Oct 2020

Hi, I just received my RTX 3090 and experience very poor performances with the latest PyTorch 1.7 nightlies with CUDA 11.
I'm benchmarking from a Windows 10 laptop with:
-a RTX 2060 Mobile
-a RTX 3090 within an eGPU enclosure, connected via Thunderbolt 3 to the laptop
and using libtorch 1.7 nightly from today, from the CUDA 11.0 folder: https://download.pytorch.org/libtorch/nightly/cu110/libtorch-win-shared-with-deps-latest.zip

I then create a float32 tensor of size 32x16x512x512 and a conv2d with parameters in:16, out:16, kernel:3, padding:1
I upload the tensor and conv2d to the wanted device, and run it 1000 times.

If I unplug the eGPU I only have the RTX 2060 Mobile, it takes 77ms

If I plug the eGPU, I then have 2 CUDA devices.
Running it on the RTX 2060 Mobile now takes 100ms
Running it on the RTX 3090 takes 62ms

So... there's something very odd happening here. I checked the Task Manager to be sure the data and calculation were correctly dispatched to the wanted device.

Running standard 3D benchmarks (such as Boundary) show expected performances, with the RTX 3090 being 6x time faster than the RTX 2060 Mobile. So there's nothing wrong with the hardware or the driver. But for some reason, the libtorch perfs are terrible on my RTX 3090.

For reference, the RTX 2060 Mobile is 4.6 TFLOPS (CUDA Cores Float32) while the RTX 3090 is 35.6 TFLOPS (CUDA Cores Float32).
So I was expecting to see the 3090 performing 7-8x times faster on this kind of calculation vs the 2060M. In libtorch it's almost the same computation speed I'm getting with conv2d.

Any idea ?

divideconcept on 9 Oct 2020

I have experienced the same poor performance with you. I think the PyTorch cannot support sm_86 well.

发自我的iPhone

------------------ Original ------------------
From: Robin Lobel <[email protected]>
Date: Sat,Oct 10,2020 3:35 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

Hi, I just received my RTX 3090 and experience very poor performances with the latest PyTorch 1.7 nightlies with CUDA 11.
I'm benchmarking from a Windows 10 laptop with:
-a RTX 2060 Mobile
-a RTX 3090 within an eGPU enclosure, connected via Thunderbolt 3 to the laptop
and using libtorch 1.7 nightly from today, from the CUDA 11.0 folder: https://download.pytorch.org/libtorch/nightly/cu110/libtorch-win-shared-with-deps-latest.zip

I then create a tensor of size 32x16x512x512 and a conv2d with parameters in:16, out:16, kernel:3, padding:1
I upload the tensor and conv2d to the wanted device, and run it 1000 times.

If I unplug the eGPU I only have the RTX 2060 Mobile, it takes 77ms

If I plug the eGPU, I then have 2 CUDA device.
Running it on the RTX 2060 Mobile now takes 100ms
Running it on the RTX 3090 takes 62ms

So... there's something very odd happening here.

Running standard 3D benchmarks (such as Boundary) show expected performance, with the RTX 3090 being 5x time faster than the RTX 2060 Mobile. So there's nothing wrong with the hardware or the driver. But for some reason, the libtorch perfs are terrible.

Any idea ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 10 Oct 2020

I'm also getting sub-optimal performance on WSL2 / 3090.

bryanhpchiang on 10 Oct 2020

Maybe it's because PyTorch nightlies currently use CUDA 11.0, which only supports GA100/sm80 (A100), while CUDA 11.1 is required for GA102/sm86 (RTX 3xxx) ? https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#title-new-features
If so, any way to patch this?

divideconcept on 10 Oct 2020

I guess the performance regression comes from cudnn8.0.3.33 (nightly binary), which doesn't ship with updated heuristics for the 30XX series. I'll rerun some more tests and check our internal perf. numbers to verify it.

ptrblck on 10 Oct 2020

👍1

Yes. I think it is true. Please wait the official Reply.

发自我的iPhone

------------------ Original ------------------
From: Robin Lobel <[email protected]>
Date: Sat,Oct 10,2020 11:16 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

WangWenhao0716 on 10 Oct 2020

Thanks!

发自我的iPhone

------------------ Original ------------------
From: ptrblck <[email protected]>
Date: Sat,Oct 10,2020 11:18 AM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

I guess the performance regression comes from cudnn8.0.3.33 (nightly binary), which doesn't ship with updated heuristics for the 30XX series. I'll rerun some more tests and check our internal perf. numbers to verify it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 10 Oct 2020

@ptrblck I just upgraded to the latest driver (456.71), and then replaced PyTorch's CUDA 11.0 dlls and cuDNN 8.0.3 dlls with CUDA 11.1 dlls and cuDNN 8.0.4 dlls downloaded from the NVIDIA Developer website.

Same test as in my first message ( https://github.com/pytorch/pytorch/issues/45028#issuecomment-706366727 )
2060 Mobile (no other GPU): 105ms
When plugging the 3090 eGPU to the laptop:
2060 Mobile: 130ms
3090 eGPU: 76ms

So... unfortunately no improvements here.

divideconcept on 10 Oct 2020

@divideconcept

Just wanted to share my results.

Setup: single 3090, Ubuntu 20.04, CUDA 11.1, cuDNN 8.04, Python 3.8.3 installed via miniconda3. I compiled PyTorch 1.8 from source ('1.8.0a0+b7261de') with TORCH_CUDA_ARCH_LIST=8.6. Verified correct CUDA and cuDNN versions.

Wrote a few lines to do what you tested:

test.py

import time

import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn

cudnn.benchmark = True
cuda = torch.device("cuda")

def sync():
  torch.cuda.synchronize()

def bench(f):
  sync()
  start = time.perf_counter()
  f()
  torch.cuda.synchronize()
  end = time.perf_counter()
  return end - start

x = torch.randn(32, 16, 512, 512, device=cuda)

class Test(nn.Module):
  def __init__(self):
    super().__init__()
    self.c = nn.Conv2d(16, 16, 3, padding=1)

  def forward(self, x):
    return self.c(x)

m = Test().to(cuda)

def run():
  o = m(x)
  return o

# warmup
for _ in range(10):
  diff = bench(run)
  print(diff)
print('warmup done')

# test
total = 0
for _ in range(1000):
  diff = bench(run)
  total += diff

print(f'average inference time (s) = {total / 1000}')

Results:

(base) bryan@ito:~/work/install_deps$ python test.py 
1.1853430050000497
0.005603357999916625
0.0054980210002213425
0.005847926000114967
0.00610483899981773
0.006503064999833441
0.007494664999740053
0.0064374530002169195
0.005995182000333443
0.005487881999670208
warmup done
average inference time (s) = 0.005640469918992949

bryanhpchiang on 11 Oct 2020

I test your program by the PyTorch installed by "conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly"(Cuda 11.0)
The result is similar:
0.8016552379995119 0.005678674002410844 0.005604786005278584 0.005512381998414639 0.0055013609962770715 0.00549756800319301 0.005494060998898931 0.005671074002748355 0.005573111004196107 0.005666042001394089 warmup done average inference time (s) = 0.005726967847025663

I think, the reasons maybe:
first possible reason, the PyTorch cannot support sm_86 no matter which it is compiled by ourselves or downloaded directly.
second possible reason, sm_86 performance is similar with sm_80; Cuda 11.1 performance is similar with Cuda 11.0;
But I do not think the second reason is reasonably.

------------------ 原始邮件 ------------------
发件人: "pytorch/pytorch" <[email protected]>;
发送时间: 2020年10月11日(星期天) 上午8:30
收件人: "pytorch/pytorch"<[email protected]>;
抄送: "王文昊"<[email protected]>;"Mention"<[email protected]>;
主题: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

@divideconcept

Just wanted to share my results.

Setup: single 3090, Ubuntu 20.04, CUDA 11.1, cuDNN 8.04, Python 3.8.3 installed via miniconda3. I compiled PyTorch 1.8 from source (b7261de0df) with TORCH_CUDA_ARCH_LIST=8.6. Verified correct CUDA and cuDNN versions.

Wrote a few lines to do what you tested:
import time import torch import torch.nn as nn import torch.backends.cudnn as cudnn cudnn.benchmark = True cuda = torch.device("cuda") def sync(): torch.cuda.synchronize() def bench(f): sync() start = time.perf_counter() f() torch.cuda.synchronize() end = time.perf_counter() return end - start x = torch.randn(32, 16, 512, 512, device=cuda) class Test(nn.Module): def __init__(self): super().__init__() self.c = nn.Conv2d(16, 16, 3, padding=1) def forward(self, x): return self.c(x) m = Test().to(cuda) def run(): o = m(x) return o # warmup for _ in range(10): diff = bench(run) print(diff) print('warmup done') # test total = 0 for _ in range(1000): diff = bench(run) total += diff print(f'average inference time (s) = {total / 1000}')
Results:
(base) bryan@ito:~/work/install_deps$ python test.py 1.1853430050000497 0.005603357999916625 0.0054980210002213425 0.005847926000114967 0.00610483899981773 0.006503064999833441 0.007494664999740053 0.0064374530002169195 0.005995182000333443 0.005487881999670208 warmup done average inference time (s) = 0.005640469918992949
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 11 Oct 2020

The test results on V100.
0.006647378904744983 0.006867951946333051 0.006694809999316931 0.007057830924168229 0.006681458093225956 0.006523525109514594 0.00652472279034555 0.006677041063085198 0.00664644711650908 0.006601684028282762 warmup done average inference time (s) = 0.006438648683717474
For V100, the single precision performance is 14 tflops;For 3090, the single precision performance is 36 tflops.

I do not think 0.006438648683717474 VS 0.005640469918992949 is reasonable.

------------------ 原始邮件 ------------------
发件人: "pytorch/pytorch" <[email protected]>;
发送时间: 2020年10月11日(星期天) 上午8:30
收件人: "pytorch/pytorch"<[email protected]>;
抄送: "王文昊"<[email protected]>;"Mention"<[email protected]>;
主题: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

@divideconcept

Just wanted to share my results.

Setup: single 3090, Ubuntu 20.04, CUDA 11.1, cuDNN 8.04, Python 3.8.3 installed via miniconda3. I compiled PyTorch 1.8 from source (b7261de0df) with TORCH_CUDA_ARCH_LIST=8.6. Verified correct CUDA and cuDNN versions.

Wrote a few lines to do what you tested:
import time import torch import torch.nn as nn import torch.backends.cudnn as cudnn cudnn.benchmark = True cuda = torch.device("cuda") def sync(): torch.cuda.synchronize() def bench(f): sync() start = time.perf_counter() f() torch.cuda.synchronize() end = time.perf_counter() return end - start x = torch.randn(32, 16, 512, 512, device=cuda) class Test(nn.Module): def __init__(self): super().__init__() self.c = nn.Conv2d(16, 16, 3, padding=1) def forward(self, x): return self.c(x) m = Test().to(cuda) def run(): o = m(x) return o # warmup for _ in range(10): diff = bench(run) print(diff) print('warmup done') # test total = 0 for _ in range(1000): diff = bench(run) total += diff print(f'average inference time (s) = {total / 1000}')
Results:
(base) bryan@ito:~/work/install_deps$ python test.py 1.1853430050000497 0.005603357999916625 0.0054980210002213425 0.005847926000114967 0.00610483899981773 0.006503064999833441 0.007494664999740053 0.0064374530002169195 0.005995182000333443 0.005487881999670208 warmup done average inference time (s) = 0.005640469918992949
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 11 Oct 2020

@bryanhpchiang Using the same script you shared, with PyTorch 1.8 installed on miniconda3 (Python 3.8.3/Windows 10) using
conda install pytorch torchvision cudatoolkit=11.0 -c pytorch-nightly
and then adding the cuDNN 8.0.4 dlls to the Libsite-packagestorchlib folder, I get the following results:

Without the RTX 3090 plugged, only the RTX 2060M ("cuda"):
1.5258226000000001
0.025282400000000038
0.024967000000000183
0.024914400000000114
0.0246556
0.024653999999999954
0.024640899999999633
0.024686700000000172
0.024618300000000204
0.024677200000000177
warmup done
average inference time (s) = 0.02545905070000003

Plugging the RTX 3090 ("cuda:0"):
3.3976391
0.0057410999999998324
0.0058499000000002965
0.005876299999999723
0.005699100000000179
0.005888200000000232
0.006010500000000363
0.00652290000000022
0.005884700000000187
0.005650900000000014
warmup done
average inference time (s) = 0.0057204506000000325

So it's consistent with your results and the ones from @WangWenhao0716

Additionally, I got this strange result with "cuda:1" which seemed to use both the 3090 and 2060M:
1.4994307000000004
0.00035360000000039804
0.000288500000000802
0.00017170000000010788
0.00016589999999983007
0.0001648999999996903
0.00017079999999936035
0.00019450000000009737
0.00016449999999945675
0.0002036999999992517
warmup done
average inference time (s) = 0.0001310169000000121

I'm not sure why "cuda:1" did not target the 2060M only, and how it can get as low as 0.00013 s given the first 2 results. There's probably both a targeting and sync issue here.

But anyway, even though the speedup 3090 vs 2060M is already more significant (x4.5), it's still not the expected x7.7 speed up.
Like @WangWenhao0716 comparison, it seems the RTX 3090 is 2 times slower than it should be.

I then translated your python code to C++ (sort of):

#include <torch/torch.h>

#include <ATen/cuda/Exceptions.h>
#include <c10/cuda/CUDAStream.h>
#include <cuda_runtime.h>

struct TestImpl : torch::nn::Module {
    TestImpl()
    {
        c=torch::nn::Conv2d(torch::nn::Conv2dOptions(16, 16, 3).padding(1));
        register_module("c",c);
    }

    torch::Tensor forward(const torch::Tensor& x) {
        return c(x);
    }

private:
    torch::nn::Conv2d c{nullptr};
};
TORCH_MODULE_IMPL(Test, TestImpl);

inline int64_t microseconds() { return double(std::chrono::system_clock::now().time_since_epoch() / std::chrono::microseconds(1)); }
inline double timediff(int64_t start, int64_t end) { return double(end-start)/1000000.; }

int main(int argc, char *argv[])
{
    auto cuda = std::string("cuda");

    auto x=torch::rand({32,16,512,512},cuda);

    Test m;
    m->to(cuda);

    for(int i=0; i<10; i++)
    {
        int64_t start=microseconds();
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
        int64_t end=microseconds();
        printf("%.8f \n", timediff(start,end) );
    }
    printf("warmup done\n");

    int64_t start=microseconds();
    for(int i=0; i<1000; i++)
    {
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
    }
    int64_t end=microseconds();
    printf("average inference time (s) = %.8f \n", timediff(start,end)/1000. );

    return 0;
}

...And by running tests and watching the Task Manager, I realized I did not get CUDA synchronization at all. The GPU is still heavily computing after the calls are done and times measured. Therefore all my libtorch C++ benchmarks were wrong.

What I could roughly observe though by watching the Task Manager graphs, is that the RTX 3090 computed roughly x5 time faster than the RTX 2060M, which seems to be on par with the python results - fine, but again not fully as expected given the respective TFLOPS of the 2060M and 3090.

I'd really like to have proper CUDA sync in C++ for proper benchmarking, if anyone knows how to force it, as this did not work:

        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));

divideconcept on 11 Oct 2020

I am currently using Windows10, python3.7, cuda11，and install pytorch1.7 using the following command:
_pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html_
The warning still exists, but the program can run normally and achieve the expected effect on the RTX3080 GPU.

thanks

nolehuan on 13 Oct 2020

Hi are there any comparisons between your 3080 with your old gpu？

发自我的 iPad

------------------ Original ------------------
From: nolehuan <[email protected]>
Date: Tue,Oct 13,2020 0:55 PM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

I am currently using Windows10, python3.7, cuda11，and install pytorch1.7 using the following command:
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html
The warning still exists, but the program can run normally and achieve the expected effect on the RTX3080 GPU.

thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 13 Oct 2020

amazing! Can you compare your 3080/3090 with your older gpus?

发自我的iPhone

------------------ Original ------------------
From: Onur <[email protected]>
Date: Tue,Oct 13,2020 10:00 PM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

Are you using pytorch-nightly wheels by any change? Because I think there is a problem with those. I just found out they are 9 times slower than my own build with cuda 11.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 13 Oct 2020

amazing! Can you compare your 3080/3090 with your older gpus? 发自我的iPhone
…
------------------ Original ------------------ From: Onur <[email protected]> Date: Tue,Oct 13,2020 10:00 PM To: pytorch/pytorch <[email protected]> Cc: WenhaoWang <[email protected]>, Mention <[email protected]> Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028) Are you using pytorch-nightly wheels by any change? Because I think there is a problem with those. I just found out they are 9 times slower than my own build with cuda 11. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

That might have been a quick conclusion, so I deleted it. 9 times difference seemed absurd, maybe there was an issue with env. I'll make more tests.

realiti4 on 13 Oct 2020

conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly works very well for me!!! And when I train my models using several 3080 GPUs, this version is much faster than the version compiled by me because when I compiled PyTorch from source, I do not use NCCL. Thanks very much!!!
…
------------------ 原始邮件 ------------------ 发件人: "pytorch/pytorch" <[email protected]>; 发送时间: 2020年9月19日(星期六) 晚上11:17 收件人: "pytorch/pytorch"<[email protected]>; 抄送: "王文昊"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028) @swecomic It seems to work if you switch to the nightly builds, which also means it's the in-development 1.7.0, instead of the stable release (1.6.0). conda install pytorch torchvision cudatoolkit=11 -c pytorch-nightly I got RTX 3080 working on this configuration but I'm getting some stability issues. Training goes well for a few minutes and then hangs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

它是不是比rtx要快些？，我用11.1+8.0.4直接编1.6.0好像不行，1.7.0能行吗？

Heermosi on 14 Oct 2020

amazing! Can you compare your 3080/3090 with your older gpus? 发自我的iPhone
…
------------------ Original ------------------ From: Onur <[email protected]> Date: Tue,Oct 13,2020 10:00 PM To: pytorch/pytorch <[email protected]> Cc: WenhaoWang <[email protected]>, Mention <[email protected]> Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028) Are you using pytorch-nightly wheels by any change? Because I think there is a problem with those. I just found out they are 9 times slower than my own build with cuda 11. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

I've observed a 44% performance gain compared 3090 to Titan RTX on benchmarking CRAFT algorithm.
The 3090 was running on a 8GB ram old computer with a pcie 2.0 16x slot, an i3-4130 cpu.
The Titan RTX was running on the 64GB ram computer with a pcie 3.0 16x slot with 8x mode, an i7-9700k cpu.
The gpu running time(and plus some cpu time) on titan rtx compared with 3090 was 144:100.
So that 3090 must be at least 44% more faster than Titan RTX.

Heermosi on 15 Oct 2020

@bryanhpchiang Using the same script you shared, with PyTorch 1.8 installed on miniconda3 (Python 3.8.3/Windows 10) using
conda install pytorch torchvision cudatoolkit=11.0 -c pytorch-nightly
and then adding the cuDNN 8.0.4 dlls to the Libsite-packagestorchlib folder, I get the following results:

Without the RTX 3090 plugged, only the RTX 2060M ("cuda"):
1.5258226000000001
0.025282400000000038
0.024967000000000183
0.024914400000000114
0.0246556
0.024653999999999954
0.024640899999999633
0.024686700000000172
0.024618300000000204
0.024677200000000177
warmup done
average inference time (s) = 0.02545905070000003

Plugging the RTX 3090 ("cuda:0"):
3.3976391
0.0057410999999998324
0.0058499000000002965
0.005876299999999723
0.005699100000000179
0.005888200000000232
0.006010500000000363
0.00652290000000022
0.005884700000000187
0.005650900000000014
warmup done
average inference time (s) = 0.0057204506000000325

So it's consistent with your results and the ones from @WangWenhao0716

Additionally, I got this strange result with "cuda:1" which seemed to use both the 3090 and 2060M:
1.4994307000000004
0.00035360000000039804
0.000288500000000802
0.00017170000000010788
0.00016589999999983007
0.0001648999999996903
0.00017079999999936035
0.00019450000000009737
0.00016449999999945675
0.0002036999999992517
warmup done
average inference time (s) = 0.0001310169000000121

I'm not sure why "cuda:1" did not target the 2060M only, and how it can get as low as 0.00013 s given the first 2 results. There's probably both a targeting and sync issue here.

But anyway, even though the speedup 3090 vs 2060M is already more significant (x4.5), it's still not the expected x7.7 speed up.
Like @WangWenhao0716 comparison, it seems the RTX 3090 is 2 times slower than it should be.

I then translated your python code to C++ (sort of):
#include <torch/torch.h>

#include <ATen/cuda/Exceptions.h>
#include <c10/cuda/CUDAStream.h>
#include <cuda_runtime.h>

struct TestImpl : torch::nn::Module {
    TestImpl()
    {
        c=torch::nn::Conv2d(torch::nn::Conv2dOptions(16, 16, 3).padding(1));
        register_module("c",c);
    }

    torch::Tensor forward(const torch::Tensor& x) {
        return c(x);
    }

private:
    torch::nn::Conv2d c{nullptr};
};
TORCH_MODULE_IMPL(Test, TestImpl);

inline int64_t microseconds() { return double(std::chrono::system_clock::now().time_since_epoch() / std::chrono::microseconds(1)); }
inline double timediff(int64_t start, int64_t end) { return double(end-start)/1000000.; }

int main(int argc, char *argv[])
{
    auto cuda = std::string("cuda");

    auto x=torch::rand({32,16,512,512},cuda);

    Test m;
    m->to(cuda);

    for(int i=0; i<10; i++)
    {
        int64_t start=microseconds();
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
        int64_t end=microseconds();
        printf("%.8f \n", timediff(start,end) );
    }
    printf("warmup done\n");

    int64_t start=microseconds();
    for(int i=0; i<1000; i++)
    {
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
    }
    int64_t end=microseconds();
    printf("average inference time (s) = %.8f \n", timediff(start,end)/1000. );

    return 0;
}
...And by running tests and watching the Task Manager, I realized I did not get CUDA synchronization at all. The GPU is still heavily computing after the calls are done and times measured. Therefore all my libtorch C++ benchmarks were wrong.

What I could roughly observe though by watching the Task Manager graphs, is that the RTX 3090 computed roughly x5 time faster than the RTX 2060M, which seems to be on par with the python results - fine, but again not fully as expected given the respective TFLOPS of the 2060M and 3090.

I'd really like to have proper CUDA sync in C++ for proper benchmarking, if anyone knows how to force it, as this did not work:
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));

May be you can remove the 2060M on a same machine?

@bryanhpchiang Using the same script you shared, with PyTorch 1.8 installed on miniconda3 (Python 3.8.3/Windows 10) using
conda install pytorch torchvision cudatoolkit=11.0 -c pytorch-nightly
and then adding the cuDNN 8.0.4 dlls to the Libsite-packagestorchlib folder, I get the following results:

Without the RTX 3090 plugged, only the RTX 2060M ("cuda"):
1.5258226000000001
0.025282400000000038
0.024967000000000183
0.024914400000000114
0.0246556
0.024653999999999954
0.024640899999999633
0.024686700000000172
0.024618300000000204
0.024677200000000177
warmup done
average inference time (s) = 0.02545905070000003

Plugging the RTX 3090 ("cuda:0"):
3.3976391
0.0057410999999998324
0.0058499000000002965
0.005876299999999723
0.005699100000000179
0.005888200000000232
0.006010500000000363
0.00652290000000022
0.005884700000000187
0.005650900000000014
warmup done
average inference time (s) = 0.0057204506000000325

So it's consistent with your results and the ones from @WangWenhao0716

Additionally, I got this strange result with "cuda:1" which seemed to use both the 3090 and 2060M:
1.4994307000000004
0.00035360000000039804
0.000288500000000802
0.00017170000000010788
0.00016589999999983007
0.0001648999999996903
0.00017079999999936035
0.00019450000000009737
0.00016449999999945675
0.0002036999999992517
warmup done
average inference time (s) = 0.0001310169000000121

I'm not sure why "cuda:1" did not target the 2060M only, and how it can get as low as 0.00013 s given the first 2 results. There's probably both a targeting and sync issue here.

But anyway, even though the speedup 3090 vs 2060M is already more significant (x4.5), it's still not the expected x7.7 speed up.
Like @WangWenhao0716 comparison, it seems the RTX 3090 is 2 times slower than it should be.

I then translated your python code to C++ (sort of):
#include <torch/torch.h>

#include <ATen/cuda/Exceptions.h>
#include <c10/cuda/CUDAStream.h>
#include <cuda_runtime.h>

struct TestImpl : torch::nn::Module {
    TestImpl()
    {
        c=torch::nn::Conv2d(torch::nn::Conv2dOptions(16, 16, 3).padding(1));
        register_module("c",c);
    }

    torch::Tensor forward(const torch::Tensor& x) {
        return c(x);
    }

private:
    torch::nn::Conv2d c{nullptr};
};
TORCH_MODULE_IMPL(Test, TestImpl);

inline int64_t microseconds() { return double(std::chrono::system_clock::now().time_since_epoch() / std::chrono::microseconds(1)); }
inline double timediff(int64_t start, int64_t end) { return double(end-start)/1000000.; }

int main(int argc, char *argv[])
{
    auto cuda = std::string("cuda");

    auto x=torch::rand({32,16,512,512},cuda);

    Test m;
    m->to(cuda);

    for(int i=0; i<10; i++)
    {
        int64_t start=microseconds();
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
        int64_t end=microseconds();
        printf("%.8f \n", timediff(start,end) );
    }
    printf("warmup done\n");

    int64_t start=microseconds();
    for(int i=0; i<1000; i++)
    {
        m(x);
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));
    }
    int64_t end=microseconds();
    printf("average inference time (s) = %.8f \n", timediff(start,end)/1000. );

    return 0;
}
...And by running tests and watching the Task Manager, I realized I did not get CUDA synchronization at all. The GPU is still heavily computing after the calls are done and times measured. Therefore all my libtorch C++ benchmarks were wrong.

What I could roughly observe though by watching the Task Manager graphs, is that the RTX 3090 computed roughly x5 time faster than the RTX 2060M, which seems to be on par with the python results - fine, but again not fully as expected given the respective TFLOPS of the 2060M and 3090.

I'd really like to have proper CUDA sync in C++ for proper benchmarking, if anyone knows how to force it, as this did not work:
        c10::cuda::CUDAStream stream = c10::cuda::getCurrentCUDAStream();
        AT_CUDA_CHECK(cudaStreamSynchronize(stream));

Well, I guess, the strange result might be... they did not compute at all or it did not get to that device(cuda:1) at all.

Heermosi on 15 Oct 2020

I tested @bryanhpchiang's benchmark script on RTX 3080, RTX 2080 Ti and RTX 2080. This time I upgraded to cuDNN 8.0.4 and PyTorch release/1.7, compiled from source. Just in case the performance is bottlenecked by data transfer, I also added a few more Conv2d layers. Here's a table of average inference time. From the results it looks like an adequate problem size is needed to reveal some performance difference. I also found 3080 to be a little bit faster (like 10% to 20%) than 2080 Ti under real world workload, although not so impressive as advertised.

    self.c = nn.Sequential( \
      nn.Conv2d(16, 32, 3, padding = 1),
      nn.Conv2d(32, 64, 3, padding = 1),
      nn.Conv2d(64, 64, 3, padding = 1),
    )

| Configuration | RTX 3080 | RTX 2080 Ti | RTX 2080 |
| --- | --- | --- | --- |
| Conv2d 16 -> 16 | 0.00667 | 0.00640 | 0.00859 |
| Conv2d 16 -> 32 -> 64 -> 64 | 0.06730 | 0.08771 | 0.11700 |

(Update)
After training for roughly 2 hours, 3080 hangs again. Very frustrating!

Sleepwalking on 18 Oct 2020

The 3080 is too bad currently...

发自我的iPhone

------------------ Original ------------------
From: Kanru Hua <[email protected]>
Date: Sun,Oct 18,2020 4:48 PM
To: pytorch/pytorch <[email protected]>
Cc: WenhaoWang <[email protected]>, Mention <[email protected]>
Subject: Re: [pytorch/pytorch] GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation. (#45028)

I tested @bryanhpchiang's benchmark script on RTX 3080, RTX 2080 Ti and RTX 2080. This time I upgraded to cuDNN 8.0.4 and PyTorch release/1.7, compiled from source. Just in case the performance is bottlenecked by data transfer, I also added a few more Conv2d layers. Here's a table of average inference time. From the results it looks like an adequate problem size is needed to reveal some performance difference. I also found 3080 to be a little bit faster (like 10% to 20%) than 2080 Ti under real world workload, although not so impressive as advertised.
self.c = nn.Sequential( nn.Conv2d(16, 32, 3, padding = 1), nn.Conv2d(32, 64, 3, padding = 1), nn.Conv2d(64, 64, 3, padding = 1), )
Configuration RTX 3080 RTX 2080 Ti RTX 2080
Conv2d 16 -> 16 0.00667 0.00640 0.00859
Conv2d 16 -> 32 -> 64 -> 64 0.06730 0.08771 0.11700

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WangWenhao0716 on 18 Oct 2020

Yeah, not great.
RTX 2080: 10.6 TFLOPS
RTX 2080 Ti: 13.45 TFLOPS
RTX 3080: 29.8 TFLOPS

divideconcept on 18 Oct 2020

I tested @bryanhpchiang's benchmark script on RTX 3080, RTX 2080 Ti and RTX 2080. This time I upgraded to cuDNN 8.0.4 and PyTorch release/1.7, compiled from source. Just in case the performance is bottlenecked by data transfer, I also added a few more Conv2d layers. Here's a table of average inference time. From the results it looks like an adequate problem size is needed to reveal some performance difference. I also found 3080 to be a little bit faster (like 10% to 20%) than 2080 Ti under real world workload, although not so impressive as advertised.
    self.c = nn.Sequential( \
      nn.Conv2d(16, 32, 3, padding = 1),
      nn.Conv2d(32, 64, 3, padding = 1),
      nn.Conv2d(64, 64, 3, padding = 1),
    )
Configuration RTX 3080 RTX 2080 Ti RTX 2080
Conv2d 16 -> 16 0.00667 0.00640 0.00859
Conv2d 16 -> 32 -> 64 -> 64 0.06730 0.08771 0.11700
(Update)
After training for roughly 2 hours, 3080 hangs again. Very frustrating!

you may try a larger input. the timing might be inaccurate dues to inaccurate system timing, it was 50ms around in windows system, maybe you can make it larger than 1 second then compare it.

Heermosi on 22 Oct 2020

Another data point:

RTX 3090, PyTorch release/1.7, Cuda 11.1 update 1
Conv2D 16 -> 32 -> 64: 0.05526

masterhard on 7 Nov 2020

Pytorch: GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.

I recently purchased RTX 3080 and got this error

The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.

Most helpful comment

All 59 comments

Related issues