Deeplabcut: Cannot get RTX 3090 card to start training

Created on 5 Oct 2020 · 9Comments · Source: DeepLabCut/DeepLabCut

OS: Win 10
DeepLabCut Version: 2.2b8
Anaconda env used: DLC-GPU (cloned from Alex's github)
WxPython version: 4.0.7.post2
Tensorflow version: many, installed with pip (see below)
Cuda version: 10 and 11

Hi everyone,

First of all, I wanted to thank all the authors for this amazing software!

I'm starting to work with DeepLabCut and after a few promising preliminary results with an "old" GPU (Turing architecture), we decided to upgrade to the recent Ampere architecture. Since it is also backwards compatible with old CUDA versions, we thought that it would be fine. However, after trying many combinations of Tensorflow and CUDA, I cannot make it to work. Here are the combinations I have tried so far:

Cuda | Tensorflow | Cudnn | Works?

10 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
10 | 1.15.0 | 7.6.5 | Same as with tf 1.15.2
10 | 1.14.0 | 7.6.5 | Does not detect GPU
11 | 1.15.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.13.1 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.14.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.15.4 | 7.6.5 | Does not detect GPU
11 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
*Tensorflow 1.13.1 does not detect the GPU either.

Using the combinations mentioned above that recognizes the GPU, and can print "Hello, Tensorflow", I ended up stuck at this screen (see code below).

I know that in the documentation says that CUDA 10.+ is not supported, but with the old card we had, it was running fine with CUDA 11. I have very limited knowledge about this, so not sure why/how it worked.

Reading in CUDA documentation it says that Ampere architecture is compatible with CUDA 10.2 or earlier. Also, according to Tensorflow documentation, Tensorflow 1.15 should be compatible with ampere. The only caveat is that it takes too long to start (up to 30 min) but that can be fixed by increasing the cuda cache size.

So, to me, the only thing left that could be giving issues is Cudnn. According to Nvidia, support for Ampere only appeared in Cudnn 8. However, as far as I know, Anaconda only supports up to Cudnn 7.6.5 on Windows. Apparently it has reached Cudnn 8 on Linux.

Code output

[Selecting multi-animal trainer
Config:
{'all_joints': [[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9],
[10],
[11],
[12]],
'all_joints_names': ['snout',
'cap',
'leftear',
'rightear',
'spine',
'lforepaw',
'rforepaw',
'lhindpaw',
'rhindpaw',
'tailbase',
'tailend',
'cornerofbox1',
'cornerofbox2'],
'batch_size': 8,
'crop_pad': 0,
'cropratio': 0.4,
'dataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\2CamTest9_CF95shuffle3.pickle',
'dataset_type': 'multi-animal-imgaug',
'deterministic': False,
'display_iters': 500,
'fg_fraction': 0.25,
'global_scale': 0.8,
'init_weights': 'C:\Users\RyC\anaconda3\envs\dlc-gpu\lib\site-packages\deeplabcut\pose_estimation_tensorflow\models\pretrained\resnet_v1_50.ckpt',
'intermediate_supervision': False,
'intermediate_supervision_layer': 12,
'location_refinement': True,
'locref_huber_loss': True,
'locref_loss_weight': 0.05,
'locref_stdev': 7.2801,
'log_dir': 'log',
'max_input_size': 1500,
'mean_pixel': [123.68, 116.779, 103.939],
'metadataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\Documentation_data-2CamTest9_95shuffle3.pickle',
'min_input_size': 64,
'mirror': False,
'multi_step': [[0.0001, 7500], [5e-05, 12000], [1e-05, 200000]],
'net_type': 'resnet_50',
'num_joints': 13,
'num_limbs': 55,
'optimizer': 'adam',
'pafwidth': 20,
'pairwise_huber_loss': False,
'pairwise_loss_weight': 0.1,
'pairwise_predict': False,
'partaffinityfield_graph': [[5, 9],
[4, 7],
[1, 3],
[6, 9],
[4, 8],
[5, 6],
[2, 8],
[0, 7],
[8, 9],
[1, 6],
[0, 10],
[3, 7],
[0, 3],
[2, 5],
[2, 4],
[5, 8],
[1, 2],
[4, 9],
[6, 7],
[2, 9],
[3, 10],
[6, 10],
[8, 10],
[1, 5],
[3, 6],
[0, 4],
[1, 10],
[7, 10],
[4, 10],
[2, 6],
[4, 5],
[1, 4],
[2, 10],
[9, 10],
[3, 9],
[0, 5],
[1, 9],
[2, 3],
[0, 8],
[3, 5],
[0, 1],
[2, 7],
[7, 9],
[7, 8],
[5, 10],
[4, 6],
[6, 8],
[5, 7],
[3, 8],
[0, 6],
[1, 8],
[1, 7],
[0, 9],
[3, 4],
[0, 2]],
'partaffinityfield_predict': True,
'pos_dist_thresh': 17,
'project_path': 'C:\Users\RyC\2CamTest9-CF-2020-10-04',
'regularize': False,
'rotation': 25,
'rotratio': 0.4,
'save_iters': 10000,
'scale_jitter_lo': 0.5,
'scale_jitter_up': 1.25,
'scoremap_dir': 'test',
'shuffle': True,
'snapshot_prefix': 'C:\Users\RyC\2CamTest9-CF-2020-10-04\dlc-models\iteration-0\2CamTest9Oct4-trainset95shuffle3\train\snapshot',
'stride': 8.0,
'weigh_negatives': False,
'weigh_only_present_joints': False,
'weigh_part_predictions': False,
'weight_decay': 0.0001}
Activating limb prediction...
Starting with multi-animal imaug + adam pose-dataset loader.
Batch Size is 8
Getting specs multi-animal-imgaug 55 13
Initializing ResNet
Loading ImageNet-pretrained resnet_50
2020-10-05 10:40:16.943131: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-10-05 10:40:16.946595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:08:00.0
2020-10-05 10:40:16.946675: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-10-05 10:40:16.948226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-10-05 10:40:16.948570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-10-05 10:40:16.948928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-10-05 10:40:16.949263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-10-05 10:40:16.949302: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-10-05 10:40:16.949559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-10-05 10:40:16.949840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-10-05 10:40:17.963045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-05 10:40:17.963140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2020-10-05 10:40:17.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2020-10-05 10:40:17.964440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22071 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6)
Max_iters overwritten as 3000
Display_iters overwritten as 10
Save_iters overwritten as 50
Training parameters:
{'stride': 8.0, 'weigh_part_predictions': False, 'weigh_negatives': False, 'fg_fraction': 0.25, 'mean_pixel': [123.68, 116.779, 103.939], 'shuffle': True, 'snapshot_prefix': 'C:\Users\RyC\2CamTest9-CF-2020-10-04\dlc-models\iteration-0\2CamTest9Oct4-trainset95shuffle3\train\snapshot', 'log_dir': 'log', 'global_scale': 0.8, 'location_refinement': True, 'locref_stdev': 7.2801, 'locref_loss_weight': 0.05, 'locref_huber_loss': True, 'optimizer': 'adam', 'intermediate_supervision': False, 'intermediate_supervision_layer': 12, 'regularize': False, 'weight_decay': 0.0001, 'crop_pad': 0, 'scoremap_dir': 'test', 'batch_size': 8, 'dataset_type': 'multi-animal-imgaug', 'deterministic': False, 'mirror': False, 'pairwise_huber_loss': False, 'weigh_only_present_joints': False, 'partaffinityfield_predict': True, 'pairwise_predict': True, 'all_joints': [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]], 'all_joints_names': ['snout', 'cap', 'leftear', 'rightear', 'spine', 'lforepaw', 'rforepaw', 'lhindpaw', 'rhindpaw', 'tailbase', 'tailend', 'cornerofbox1', 'cornerofbox2'], 'cropratio': 0.4, 'dataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\2CamTest9_CF95shuffle3.pickle', 'display_iters': 500, 'init_weights': 'C:\Users\RyC\anaconda3\envs\dlc-gpu\lib\site-packages\deeplabcut\pose_estimation_tensorflow\models\pretrained\resnet_v1_50.ckpt', 'max_input_size': 1500, 'metadataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\Documentation_data-2CamTest9_95shuffle3.pickle', 'min_input_size': 64, 'multi_step': [[0.0001, 7500], [5e-05, 12000], [1e-05, 200000]], 'net_type': 'resnet_50', 'num_joints': 13, 'num_limbs': 55, 'pafwidth': 20, 'pairwise_loss_weight': 0.1, 'partaffinityfield_graph': [[5, 9], [4, 7], [1, 3], [6, 9], [4, 8], [5, 6], [2, 8], [0, 7], [8, 9], [1, 6], [0, 10], [3, 7], [0, 3], [2, 5], [2, 4], [5, 8], [1, 2], [4, 9], [6, 7], [2, 9], [3, 10], [6, 10], [8, 10], [1, 5], [3, 6], [0, 4], [1, 10], [7, 10], [4, 10], [2, 6], [4, 5], [1, 4], [2, 10], [9, 10], [3, 9], [0, 5], [1, 9], [2, 3], [0, 8], [3, 5], [0, 1], [2, 7], [7, 9], [7, 8], [5, 10], [4, 6], [6, 8], [5, 7], [3, 8], [0, 6], [1, 8], [1, 7], [0, 9], [3, 4], [0, 2]], 'pos_dist_thresh': 17, 'project_path': 'C:\Users\RyC\2CamTest9-CF-2020-10-04', 'rotation': 25, 'rotratio': 0.4, 'save_iters': 10000, 'scale_jitter_lo': 0.5, 'scale_jitter_up': 1.25}
Starting multi-animal training....
2020-10-05 10:40:27.731872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll]

Upon reading in some forums, some people have been succesful using Symlink in other applications, so I tried that with Cudnn64_7.dll and hardlinked to Cudnn64_8.dll inside DLC-GPU enviroment, but I have not been able to make it work. It shows an error saying that compute capabilities does not match.

Do you have any suggestion that I might try?

Many thanks in advance.

WORK IN PROGRESS! backwards compatibility tensorflotraining

Source

cfernandezpa

Most helpful comment

Little update. I noticed that I had "gputouse=0", so I changed it to 1 and started much faster and it is training like 100X faster.

I'll keep you posted with any advances I make.

cfernandezpa on 18 Oct 2020

👍2

All 9 comments

We just got a 3090 in the lab this week; so we can test it. But in general what I would suggest is running our testscripts always as a first pass after installation.

https://www.youtube.com/watch?v=IOWtKn3l33s

https://github.com/DeepLabCut/DeepLabCut/tree/master/examples

MMathisLab on 11 Oct 2020

Thank you very much for the reply. That is great news, hopefully you would be able to make it work! I won't be able to try testscripts this week but I'll do the other week for sure and I'll report back.

Thanks again!

cfernandezpa on 12 Oct 2020

sorry we haven't gotten to this yet; but you might try our dev branch with TF2.x--> https://github.com/DeepLabCut/DeepLabCut-core/tree/tf2.2alpha

MMathisLab on 16 Oct 2020

Hi,

I have tried the testscript with version 2.2b8 and it stops at the same point to when I tried with my data set.
I am going to try now with the Dev branch and see how it goes.

Thanks!

cfernandezpa on 16 Oct 2020

Hi,

I am currently trying with the dev branch and I was able to start training. However, it was far from ideal. First, I learned that TF2.2 does not work with CUDA 11 (let alone 11.1), so it won't recognize the GPU. So, I had to install CUDA 10.1 which is supposed to be the version that works with TF2.2. That change made the system to recognize the GPU. Then, training took a long time to start but it engaged the GPU as seen in Task manager. A warning message was shown about PTX compiling been done by the driver (I cannot find the original message in the training log), after which training started but it was very slow. Also, the reduction of the "loss" value after each iteration seems smaller than I remembered, but I have no objective way to confirm this. In any case, I was able to train for 10000 iterations which is a good progress.

I think one possible solution is to compile TF2.2 or 2.3 with CUDA 11.1 from sources, but I don't know how to do that in Windows. I found an article on how to do it for Linux (https://towardsdatascience.com/how-to-compile-tensorflow-2-3-with-cuda-11-1-8cbecffcb8d3). Could you please advice on this matter?

If I find anything else, I'll post it here.

Thanks!

cfernandezpa on 17 Oct 2020

Little update. I noticed that I had "gputouse=0", so I changed it to 1 and started much faster and it is training like 100X faster.

I'll keep you posted with any advances I make.

cfernandezpa on 18 Oct 2020

👍2

Hi,

I noticed you closed this issue, which is fair since I was able to train using DeepLabCutCore. However, I'm not sure about the validity of the results of the training as I'm unable to evaluate it with either this version or using the GUI with version 2.2b8; there is a Key error after evaluation started. Also, the available options in the Core version are limited as you know.

So, my question is, should there be another issue open to tackle DLC compatibility with RTX 3000 series cards? I'm willing to help as far as my skills allow.

Thanks!

cfernandezpa on 19 Oct 2020

it's a good point; i'll reopen until it's really resolved; for now, also people can hopefully find the TF2.x branch!

However, I'm not sure about the validity of the results of the training as I'm unable to evaluate it with either this version or using the GUI with version 2.2b8;

correct - the branch is only up to date with 2.1.8.1! :) so when we roll up to 2.2x for TF that would work again.

MMathisLab on 20 Oct 2020

Hi,

I have been testing some more and I have made some progress. I can confirm that the training works well with the following system settings:

Deeplabcutcore
CUDA 11.1
Cudnn 8.0.4.30
Drivers 456.71
Tensorflow tf-nightly-gpu 2.5.0.dev20201019

I had Deeplabcut and TF installed in a Python environment (not Anaconda) and I was able to train, evaluate, analyze and create a video. I enconunter an issue where the video analysis was running very low, which makes me think that the GPU was not fully engaged in this part.

Hopefully the full version, including the GUI would be available soon.

Thanks!

cfernandezpa on 31 Oct 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

MSE/RMSE calculation seems to be wrong

vvolhejn · 4Comments

Restoring Model and Activation Maps (Function to save scoremaps in dictionary added)

cvKDean · 4Comments

Command ‘ffmprobe -i

Charly77220 · 3Comments

IndexError running "Step1_EvaluateModelonDataset.py" (Run training until at least 1 snapshot is stored before proceeding to "Step1_EvaluateModelonDataset.py")

cathy-liu23 · 3Comments

Error when importing DeepLabCut on AWS

N-Sensho · 3Comments