Pytorch: Version 1.3 no longer supporting Tesla K40m?

Created on 27 Nov 2019 · 61Comments · Source: pytorch/pytorch

🐛 Bug

I am using a Tesla K40m, installed pytorch 1.3 with conda, using CUDA 10.1

To Reproduce

Steps to reproduce the behavior:

Have a box with a Tesla K40m
conda install pytorch cudatoolkit -c pytorch
show cuda is available

python -c 'import torch; print(torch.cuda.is_available());'
>>> True

Instantiate a model and call .forward()

Traceback (most recent call last):
  File "./baselines/get_results.py", line 395, in <module>
    main(args)
  File "./baselines/get_results.py", line 325, in main
    log_info = eval_main(eval_args)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/baselines/eval_task.py", line 165, in main
    log_info = trainer.test(0, evaluate=True)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_trainers.py", line 110, in test
    evaluate=evaluate)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_trainers.py", line 220, in iteration
    model_output = self.model.forward(input_data, input_lengths)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_models.py", line 49, in forward
    self.hidden = self.init_hidden(batch_size, device=device)
  File "/mnt/cdtds_cluster_home/s0816700/git/midi_degradation_toolkit/mdtk/pytorch_models.py", line 40, in init_hidden
    return (torch.randn(1, batch_size, self.hidden_dim, device=device),
RuntimeError: CUDA error: no kernel image is available for execution on the device

First tried downgrading to cudatoolkit=10.0, that exhibited same issue.

The code will run fine if you repeat steps above but instead conda install pytorch=1.2 cudatoolkit=10.0 -c pytorch.

Expected behavior

If no longer supporting a specific GPU, please bomb out upon load with useful error message.

Environment

Unfort ran your script after I 'fixed' so pytorch version will be 1.2 here - issue encountered with version 1.3.

Collecting environment information...
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Scientific Linux release 7.6 (Nitrogen)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: version 2.8.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K40m
Nvidia driver version: 430.50
cuDNN version: /usr/lib64/libcudnn.so.6.5.18

Versions of relevant libraries:
[pip3] numpy==1.16.3
[pip3] numpydoc==0.8.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.15           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] pytorch                   1.2.0           py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch
[conda] torchvision               0.4.0                py37_cu100    pytorch

cc @ezyang @gchanan @zou3519 @jerryzh168 @ngimel

binaries cuda docs triaged

Source

JamesOwers

👍7 🚀1

Most helpful comment

I'd just like to suggest that the compatible compute capabilities for the precompiled binaries be added somewhere to the documentation, especially when providing installation instructions for the binaries. That information does not appear to be readily available anywhere.

jeherr on 16 Dec 2019

👍16

All 61 comments

Just to be sure, were you using 1.3.0 or 1.3.1?

albanD on 27 Nov 2019

1.3.1

conda list 'pytorch|cuda'
>>> # packages in environment at /home/s0816700/miniconda3/envs/mdtk:
>>> #
>>> # Name                    Version                   Build  Channel
>>> cudatoolkit               10.1.243             h6bb024c_0  
>>> pytorch                   1.3.1           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch

Was the env at point of failure.

JamesOwers on 27 Nov 2019

👍1

cc @ngimel

albanD on 27 Nov 2019

K40m has a compute capability of 3.5, which I believe we have dropped support of.

SsnL on 27 Nov 2019

👍1

Ok. Please may you implement a useful "oldgpu" warning? Like here: https://github.com/pytorch/pytorch/issues/6529

Error at the moment very unclear to casual user like me.

--- EDIT ---
Would also be great to link users:

to a page detailing what compute capacity you support (if this exists) and
how to find out what the compute capacity of your GPU is (I guess here: https://developer.nvidia.com/cuda-gpus#compute for most?)

Struggl(ed/ing) to find both of those things!

As an aside, @SsnL - possibly this line needs updating if you are correct:
https://github.com/pytorch/pytorch/blob/bf61405ed61816b23c57718722145f26f217666a/torch/__init__.py#L10. Where did you get your information about minimal compute capability support?

JamesOwers on 27 Nov 2019

👍5

@JamesOwers If I'm not mistaken, this commit bumped the minimal compute capability to 3.7.

ptrblck on 30 Nov 2019

There's no technical reason for it to be changed to 3.7 right?
The code still supports 3.5 (and even 3.0 again).

This is just for Conda? Looks like it went from 3.5 and 5.0+ to 3.7 and 5.0+ so it was always missing either 3.5 or 3.7. I suppose it takes too long/becomes too large to support more than 2 built architectures.

xsacha on 30 Nov 2019

👍2

@soumith might correct me, but I think the main reason is the growing size of the binaries.

ptrblck on 1 Dec 2019

👎1

@ptrblck that is the reason but it is strange it went from supporting K40 (+ several consumer cards) and not K80 to supporting K80 and not K40 (+ several consumer cards).

on an NVIDIA GPU with compute capability >= 3.0.

I also wish there was a way for the message to reflect the minimum cuda arch from the cuda arch list for when it was compiled. This would make it easier when it gets changed to 3.7, for example. Or when a user supports 3.0 by compiling it themselves.

xsacha on 2 Dec 2019

This is also being discussed at https://github.com/pytorch/pytorch/issues/24205#issuecomment-560185215

ezyang on 3 Dec 2019

jeherr on 16 Dec 2019

👍16

k40m with cuda10.0 get the same error!!!
build from source get more Error!!!

shiyongde on 4 Mar 2020

👍4

hi guys, i have made a python 3.6 pytorch 1.3.1 linux_x86_64 wheel without restriction on compute capability, and it's working on my 3.5 GPU. i would be more than happy to build wheels for different python and pytorch versions if someone can tell me a proper distribution channel (i.e. not google drive).

jayenashar on 1 May 2020

👍1

@jayenashar Are you able to provide instruction to build Pytorch version 1.3.1 for a specific GPU (NVIDIA Tesla K20 GPU) & Python 3.6.8? I've attempted to build a compatible version but am still having hardware compatibility issues:
[W NNPACK.cpp:77] Could not initialize NNPACK! Reason: Unsupported hardware.

aln3 on 15 May 2020

@anowlan123 I don't see a reason to build for a specific GPU, but I believe you can export the environment variable TORCH_CUDA_ARCH_LIST for your specific compute capability (3.5), then use the build-from-source instructions for pytorch.

The pytorch 1.3.1 wheel I made should work for you (python 3.6.9, NVIDIA Tesla K20 GPU). I setup a pypi account to try and distribute it, but it seems there is a 60MB limit, and my wheel is 139MB. So I have uploaded it here: https://github.com/UNSWComputing/pytorch/releases/download/v1.3.1/torch-1.3.1-cp36-cp36m-linux_x86_64.whl

jayenashar on 15 May 2020

Dear @anowlan123,

I would be very interested in a wheel of pytorch1.4 that work with Keppler K40 and cuda9.2. Would you be able to help out? I am thinking about installing that via miniconda.

PeteKey on 18 May 2020

@PeteKey you didn't specify a python version, but i made a wheel with python 3.6, pytorch 1.4.1, and magma-cuda92. please try it here: https://github.com/UNSWComputing/pytorch/releases/download/v1.4.1/torch-1.4.1-cp36-cp36m-linux_x86_64.whl if you have any issues, please upgrade to cuda10.2.

jayenashar on 18 May 2020

@anowlan123, python 3.6 is fine but I guess something is not quite working yet. Any idea how to fix this? I am running this on ubuntu 14.04 if that matters.

File "/home/pk/miniconda3/envs/pytorch1.4py36_unsw_anowlan123/lib/python3.6/site-packages/torch/__init__.py", line 81, in
from torch._C import *
ImportError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

PeteKey on 18 May 2020

@jayenashar Thanks, still having the same compatibility issues though. @PeteKey once i create an conda enviroment, i used this script to build from source.

!/bin/bash

Make sure conda enviroment is activated

cd /home/user
conda activate env

conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
conda install -c pytorch magma-cuda101

cd ~/anaconda3/envs/env/compiler_compat
mv ld ld-old

Prep Pyorch Repo

cd /home/user/Downloads
git clone --recursive https://github.com/pytorch/pytorch
cd /home/user/Downloads/pytorch
git submodule sync
git submodule update --init --recursive

Specify environment variables for specific pytorch build

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export PYTORCH_BUILD_VERSION=1.3.1
export PYTORCH_BUILD_NUMBER=1
export TORCH_CUDA_ARCH_LIST=3.5

Build and install

python setup.py install

Clean the build

setup.py clean --all

cd ~/anaconda3/envs/env/compiler_compat
mv ld-old ld

aln3 on 18 May 2020

python 3.6 is fine but I guess something is not quite working yet. Any idea how to fix this? I am running this on ubuntu 14.04 if that matters.

File "/home/pk/miniconda3/envs/pytorch1.4py36_unsw_anowlan123/lib/python3.6/site-packages/torch/init.py", line 81, in
from torch._C import *
ImportError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

@PeteKey i think it's because you are using the wheel in conda. try installing mkl in your conda env.

also, upgrade your ubuntu

jayenashar on 18 May 2020

@anowlan123 i don't see where you checkout v1.3.1 in the git repo. also i'm not sure where your install installs to. here is the script i used, it only makes the wheel, then i used pip to install it.

#!/bin/bash
CONDA_ENV=py369
PYTHON_VERSION=3.6.9
CUDA_VERSION=102
PYTORCH_BUILD_VERSION=1.3.1

set -e
set -u

conda create --yes --name $CONDA_ENV python=$PYTHON_VERSION
conda activate $CONDA_ENV
conda install --yes numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
conda install --yes --channel pytorch magma-cuda$CUDA_VERSION

cd ~/pytorch
git checkout v$PYTORCH_BUILD_VERSION
git submodule sync
git submodule update --init --recursive
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export PYTORCH_BUILD_VERSION
PYTORCH_BUILD_NUMBER=0 python setup.py bdist_wheel

jayenashar on 18 May 2020

@PeteKey i just uploaded files for conda (py3.5 and py3.6) to https://github.com/UNSWComputing/pytorch/releases/tag/v1.4.1 . i think you can download the files and install them with conda install $filename but i'm not sure how to tell conda to install the dependencies. maybe with --only-deps? here are the dependencies of the py3.6 tarball if you need them:

    "blas * mkl", 
    "cudatoolkit >=9.2,<9.3", 
    "mkl >=2018", 
    "ninja", 
    "numpy >=1.11", 
    "python >=3.6,<3.7.0a0"

jayenashar on 18 May 2020

I just got it compiled once and it worked with K40 (so that's the proof of concept done), but then I did something and it did get messed up so I am starting from scratch.

What would be best steps to install torchvision for cuda 10.0 and cudatoolkit 10.0, and then compile the pytorch 1.4? I suppose the question is in which order I should do things in order not to end up with mess as miniconda install likes to 'resolve' lots of things and my feeling is that causes mess later on as it wants to download pytorch 1.4 from the channel etc.

PeteKey on 18 May 2020

maybe the best thing is to install the upstream pytorch 1.4.1 for cuda 10.0 and then install the package i can make for you. this is how i am making the conda tarballs with pytorch/builder:

export PYTORCH_REPO=pytorch
export PYTORCH_BRANCH=v1.4.1
export PYTORCH_BUILD_VERSION=1.4.1
export PYTORCH_BUILD_NUMBER=0
export TORCH_CONDA_BUILD_FOLDER=pytorch-nightly
export TORCH_PACKAGE_NAME=torch
export PIP_UPLOAD_FOLDER=""
export NIGHTLIES_ROOT_FOLDER="$HOME/local/builder/binaries_v1.4.1"
cd pytorch-builder/cron
./build_multiple.sh conda 3.6 cu92

jayenashar on 19 May 2020

@jayenashar, I am very close but one thing that is still weird is that even when I export CUDA_VERSION=100, the torch.version.cuda still shows cuda 10.2 ... which is a bit too new for some codes I am trying to run.

If you tell me how to force compilation to use and then have pytorch with cuda 10, this will solve my aches. Below I list my packages (I can see cuda100 and cudatulkit 10.0 there) so am confused.

PeteKey on 19 May 2020

@jayenashar, I am very close but one thing that is still weird is that even when I export CUDA_VERSION=100, the torch.version.cuda still shows cuda 10.2 ... which is a bit too new for some codes I am trying to run.

@PeteKey I'm not 100% sure the CUDA_VERSION in my script actually works. It may only set the package version numbers and not set what version of CUDA it builds with. if you run nvcc --version on the machine that you are using to build from source, that probably needs to be 10.0.

jayenashar on 19 May 2020

@jayenashar, @anowlan123, thanks for help. In the end I have compiled pytorch 1.5 for cuda10.2 and installed other packages via pip install, and finally the codes I wanted to run kicked off. So, my old K40 gets a bit more life.

PeteKey on 19 May 2020

@jayenashar, @anowlan123, last more question... bl...y pytorch somehow has got compiled without magma ... Do you know if I need to setup any flags for that. I had magma installed yet ...

PeteKey on 19 May 2020

@jayenashar, @anowlan123, solved, export MAGMA_HOME=/home/pk/miniconda3/pkgs/magma-cuda102-2.5.2-1/

PeteKey on 20 May 2020

Guys, thank you for providing the solutions to get torch running with K40! 👍🏻

I seem to have built torch from source fine, but it fails to import extension from torch.utils.cpp_extension import xxx

My env:
K40, export *ARCH=3.5
Torch v1.5.0, build finished ok and I can import torch

Note, the source built torch==1.5.0 did not replace conda-installed pytorch==1.3.1, which I uninstalled along with its dependencies.

Do I need to specify in the torch source build to use CUDA and cpp_extension ?

Thank you

breznak on 29 May 2020

👍2

@breznak

There's some differences between conda and pip such that the names are different (pytorch vs torch) hence the old version was not replaced.

from torch.utils import cpp_extension works for me with my 1.3.1 pip package, so i'm not sure what is the issue with yours.

jayenashar on 29 May 2020

from torch.utils import cpp_extension works for me with my 1.3.1 pip package, so i'm not sure what is the issue with yours.

works for me too, but I need torch >= 1.4, and I need to build from source (K40) ..these are the diffs.

breznak on 31 May 2020

do any of the files i uploaded to https://github.com/UNSWComputing/pytorch/releases/tag/v1.4.1 work for you?

jayenashar on 31 May 2020

do any of the files i uploaded to https://github.com/UNSWComputing/pytorch/releases/tag/v1.4.1 work for you?

thanks! I'll test but guessing from the names I don't think so, unfortunately.
My constraints are:

py>= 3.6
cuda >=10.1
torch>=1.4.1

breznak on 31 May 2020

I assume since you tried pytorch 1.5.0 first, that is your preferred one, so try this one: https://github.com/UNSWComputing/pytorch/releases/tag/v1.5.0 Also assumed the minimums py & cuda are what you already have installed, so built based on those assumptions.

jayenashar on 31 May 2020

👍1

@jayenashar thank you very much for supporting this!
I've tried your repo, but import torch is failing for me after install with

from torch._C import *
ImportError: /home/xxx/miniconda3/envs/detectron2-env/lib/python3.6/site-packages/torch/lib/libtorch_python.so: undefined symbol: PySlice_Unpack

EDIT: This looks like python issue, I was wrong with the default python: python 3.7 is default in my conda env. I was able to install python==3.6 from conda-forge, but not sure if this err is related to use of that python ?

breznak on 31 May 2020

Are you on Python 3.6.0? I believe there's an issue with that and you need any other 3.6 (i.e., 3.6.1- 3.6.10)

jayenashar on 31 May 2020

If you run python3.7, it shouldn't try to use this version of torch, so i would guess you have python 3.6.0

jayenashar on 31 May 2020

Are you on Python 3.6.0? I believe there's an issue with that

you're correct. (I had default python 3.7 installed in conda env, since I mistakenly told you I have 3.6, I installed python 3.6.0 available at conda-forge.) I'll see if there's later version of 3.6.x available at conda.

breznak on 31 May 2020

Dear @jayenashar Can you please build PyTorch 1.5.0 for Python 3.7? I was trying to build from source, but met some unexpected errors. Really appreciate your effort!
My settings are:

Python 3.7
CUDA 10.1

Update: I setup a Python 3.6 environment and the package provided by @jayenashar (https://github.com/UNSWComputing/pytorch/releases/tag/v1.5.0) works well.
The latest torchvision (0.6.0) from Conda seems also work.

wydwww on 4 Jun 2020

👍2

No problem. For anyone making future requests, I would like the following info in your request:

pytorch version - patch level (1.3.0 should work with most gpus. 1.3.1 dropped support for a lot of our gpus.)
conda or pip
cuda version - minor level (9.2, 10.1, 10.2 - note that 10.2 supports old gpus. see https://developer.nvidia.com/cuda-gpus to find your GPU's compute capability and https://en.wikipedia.org/wiki/CUDA#GPUs_supported to find supported cuda versions)
python version - minor level (note that python 3.6.0 does not work, but other 3.6.x should work)

I do not need the exact GPU or compute capability. If your chosen version of pytorch supports it, it will be included.

I do not need the OS. I am only building linux packages.

@wydwww check https://github.com/UNSWComputing/pytorch/releases/tag/v1.5.0 for the latest upload. :)

jayenashar on 4 Jun 2020

❤4

Thank you @jayenashar , I really appreciate your work! Thanks to your prebuilt packages, I'm able to run pytorch (torchvision) on our cluster :+1:

believe there's an issue with that and you need any other 3.6 (i.e., 3.6.1- 3.6.10)

everything works fine with py 3.7.

so try this one: https://github.com/UNSWComputing/pytorch/releases/tag/v1.5.0

for the reference for people installing from these binaries, you should use conda (not python, pip) to install these packages.
conda install pytorch-....tar.gz

breznak on 5 Jun 2020

@jayenashar Any luck building pytorch 1.6? I'm running into compilation errors at the moment, and am trying to debug it. I was able to build 1.5.1 so I'm not sure what's going on.

KevinMusgrave on 5 Aug 2020

sadly my [remote] machine for building is down, and no ETA on when i can get to it to bring it back up, sorry. haven't tried 1.6.

jayenashar on 5 Aug 2020

So the "bug" was pretty silly. I forgot to clean the build folder before trying to build 1.6.0. After I cleaned it, the build was successful.

Here's a whl file in case anyone needs it. It was built without tests or caffe2 operators: https://github.com/KevinMusgrave/pytorch/releases/tag/v1.6.0-compute-capability-3.5

KevinMusgrave on 6 Aug 2020

👍2 ❤1

@jayenashar Just commenting here to thank you. Managed to make 1.5 work in a K40 :)

darolt on 20 Aug 2020

@jayenashar could be possible for you to compile a version for a tesla k40c with NVCC release 9.2, V9.2.148?

pytorch version -  closer to 1.6.0
conda or pip - any
cuda version - 10
python version - 3.8.5 or other

When i use nvcc --version it throws 9.2, but when i do nvidia-smi it outputs: Driver Version: 410.104 CUDA Version: 10.0
So I'm not sure of the cuda version. Any ideas why these versions don't match?

Edit: I managed to run PyTorch 1.4.1 with Torchvision 0.5.0 on Tesla K40c with the following procedure:

Downloaded pytorch-1.4.1-py3.7_cuda9.2.148_cudnn7.6.3_0.tar.bz2 from @jayenashar solution on https://github.com/UNSWComputing/pytorch/releases/tag/v1.4.1
Installed TorchVision with conda install torchvision=0.5.0=py37_cu92
Overrided PyTorch package with conda install pytorch-1.4.1-py3.7_cuda9.2.148_cudnn7.6.3_0.tar.bz2

Vichoko on 7 Oct 2020

@Vichoko https://github.com/UNSWComputing/pytorch/releases/tag/v1.6.0

you may have installed cuda-nvcc-9-2 and a precompiled driver. probably easier to upgrade nvcc.

jayenashar on 8 Oct 2020

👍1

@jayenashar Can you compile a version for Tesla K40c with the following:

PyTorch: latest possible
Python: 3.7.7
CUDA: 10.2
conda

Many thanks!

sophiaas on 11 Oct 2020

sorry @sophiaas it looks like i can't do CUDA 10.2 so easily. @KevinMusgrave has 10.1 at https://github.com/KevinMusgrave/pytorch/releases/tag/v1.6.0-compute-capability-3.5 in a wheel and i'm trying to build a conda version now.

jayenashar on 12 Oct 2020

@sophiaas i uploaded a 10.1 conda package to https://github.com/UNSWComputing/pytorch/releases/tag/v1.6.0

i'll keep trying to make a 10.2 but it keeps giving me a CPU-only build.

jayenashar on 12 Oct 2020

@jayenashar Thank you! Really appreciate it.

sophiaas on 12 Oct 2020

If it helps, I put up some K40-compatible pip binaries at https://nelsonliu.me/files/pytorch/whl/torch_stable.html . Versions 1.3.1 to 1.6.0, hoping to keep them updated for new releases.

You can pip-install them with (change desired versions as necessary):

pip install torch==1.3.1+cu92 -f https://nelsonliu.me/files/pytorch/whl/torch_stable.html

I tested all of these binaries (except the CUDA 10.2 ones, hoping to get to those soon) by running the word-level language modeling example on a K40 and manually verifying that the perplexity was the same within versions and generally reasonable.

nelson-liu on 13 Oct 2020

👍1

@nelson-liu that's great. then i only need to worry about conda packages.

this is how i test for cuda: python -c 'import torch; torch.randn([3,5]).cuda()'

jayenashar on 14 Oct 2020

@jayenashar
That's so great to see your releases!
I'm planning to subscribe to your future releases as well if you still plan to keep update with later pytorch releases. Would you plan to always release to that forked repo? Where should I request specific build, here or elsewhere?

Guptajakala on 30 Oct 2020

@Guptajakala yes i will release to that forked repo, unless someone knows a better place. i tried pypi but it seems they have a file size limit and that is the reason the official builds don't support old GPUs. i can try an anaconda channel.

right now i'm taking requests here as it seems to be the discoverable place.

jayenashar on 30 Oct 2020

❤1

@jayenashar
Hi, does this work with python 3.6.9?

I downloaded this one and run command
conda install ./pytorch-1.6.0-py3.7_cuda10.1.243_cudnn7.6.3_0.tar.bz2