Hi, I encountered the problem like #899 锛孊ut I checked my pytorch is not CPU version. Can anyone help? Thanks!
Traceback (most recent call last):
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/allen_wu/.vscode-server-insiders/extensions/ms-python.python-2020.3.69010/pythonFiles/lib/python/debugpy/wheels/debugpy/__main__.py", line 45, in <module>
cli.main()
File "/home/allen_wu/.vscode-server-insiders/extensions/ms-python.python-2020.3.69010/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 427, in main
run()
File "/home/allen_wu/.vscode-server-insiders/extensions/ms-python.python-2020.3.69010/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 264, in run_file
runpy.run_path(options.target, run_name="__main__")
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/allen_wu/sota_lm_dev/codebase/gpt2/Gpt2SeqClassifier.py", line 200, in <module>
trainer = Trainer(gpus=1)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 366, in __init__
self.data_parallel_device_ids = parse_gpu_ids(self.gpus)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 622, in parse_gpu_ids
gpus = sanitize_gpu_ids(gpus)
File "/home/allen_wu/miniconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 592, in sanitize_gpu_ids
raise MisconfigurationException(message)
pytorch_lightning.utilities.debugging.MisconfigurationException:
You requested GPUs: [0]
But your machine only has: []
My environment packages:
# Name Version Build Channel
_libgcc_mutex 0.1 main
absl-py 0.9.0 pypi_0 pypi
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
blas 1.0 mkl
bleach 3.1.3 pyh8c360ce_0 conda-forge
boto3 1.12.24 pypi_0 pypi
botocore 1.15.24 pypi_0 pypi
ca-certificates 2019.11.28 hecc5488_0 conda-forge
cachetools 4.0.0 pypi_0 pypi
certifi 2019.11.28 py37hc8dfbb8_1 conda-forge
chardet 3.0.4 pypi_0 pypi
click 7.1.1 pypi_0 pypi
cudatoolkit 10.1.243 h6bb024c_0
decorator 4.4.2 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
docutils 0.15.2 pypi_0 pypi
entrypoints 0.3 py37hc8dfbb8_1001 conda-forge
filelock 3.0.12 pypi_0 pypi
freetype 2.9.1 h8a8886c_1
future 0.18.2 pypi_0 pypi
google-auth 1.11.3 pypi_0 pypi
google-auth-oauthlib 0.4.1 pypi_0 pypi
grpcio 1.27.2 pypi_0 pypi
icu 64.2 he1b5a44_1 conda-forge
idna 2.9 pypi_0 pypi
importlib-metadata 1.5.0 py37hc8dfbb8_1 conda-forge
importlib_metadata 1.5.0 1 conda-forge
intel-openmp 2020.0 166
ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge
ipython 7.13.0 py37h43977f1_1 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
ipywidgets 7.5.1 pypi_0 pypi
jedi 0.16.0 py37hc8dfbb8_1 conda-forge
jinja2 2.11.1 py_0 conda-forge
jmespath 0.9.5 pypi_0 pypi
joblib 0.14.1 pypi_0 pypi
jpeg 9b h024ee3a_2
json5 0.9.0 py_0 conda-forge
jsonschema 3.2.0 py37hc8dfbb8_1 conda-forge
jupyter_client 6.0.0 py_0 conda-forge
jupyter_core 4.6.3 py37hc8dfbb8_1 conda-forge
jupyterlab 2.0.1 py_0 conda-forge
jupyterlab_server 1.0.7 py_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_0
libuv 1.34.0 h516909a_0 conda-forge
markdown 3.2.1 pypi_0 pypi
markupsafe 1.1.1 py37h8f50634_1 conda-forge
mistune 0.8.4 py37h516909a_1000 conda-forge
mkl 2020.0 166
mkl-service 2.3.0 py37he904b0f_0
mkl_fft 1.0.15 py37ha843d7b_0
mkl_random 1.1.0 py37hd6b4f25_0
nbconvert 5.6.1 py37_0 conda-forge
nbformat 5.0.4 py_0 conda-forge
ncurses 6.2 he6710b0_0
ninja 1.9.0 py37hfd86e86_0
nodejs 13.10.1 hf5d1a2b_0 conda-forge
notebook 6.0.3 py37_0 conda-forge
numpy 1.18.1 py37h4f9e942_0
numpy-base 1.18.1 py37hde5b4d6_1
oauthlib 3.1.0 pypi_0 pypi
olefile 0.46 py37_0
openssl 1.1.1e h516909a_0 conda-forge
pandas 1.0.2 py37h0573a6f_0
pandoc 2.9.2 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parso 0.6.2 py_0 conda-forge
pexpect 4.8.0 py37hc8dfbb8_1 conda-forge
pickleshare 0.7.5 py37hc8dfbb8_1001 conda-forge
pillow 7.0.0 py37hb39fc2d_0
pip 20.0.2 py37_1
prometheus_client 0.7.1 py_0 conda-forge
prompt-toolkit 3.0.4 py_0 conda-forge
protobuf 3.11.3 pypi_0 pypi
ptyprocess 0.6.0 py_1001 conda-forge
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pygments 2.6.1 py_0 conda-forge
pyrsistent 0.15.7 py37h8f50634_1 conda-forge
python 3.7.6 h0371630_2
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
pytorch-lightning 0.7.1 pypi_0 pypi
pytz 2019.3 py_0
pyzmq 19.0.0 py37hac76be4_1 conda-forge
readline 7.0 h7b6447c_5
regex 2020.2.20 pypi_0 pypi
requests 2.23.0 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rsa 4.0 pypi_0 pypi
s3transfer 0.3.3 pypi_0 pypi
sacremoses 0.0.38 pypi_0 pypi
scikit-learn 0.22.2.post1 pypi_0 pypi
scipy 1.4.1 pypi_0 pypi
send2trash 1.5.0 py_0 conda-forge
sentencepiece 0.1.85 pypi_0 pypi
setuptools 46.0.0 py37_0
six 1.14.0 py37_0
sklearn 0.0 pypi_0 pypi
sqlite 3.31.1 h7b6447c_0
tensorboard 2.1.1 pypi_0 pypi
terminado 0.8.3 py37hc8dfbb8_1 conda-forge
testpath 0.4.4 py_0 conda-forge
tk 8.6.8 hbc83047_0
tokenizers 0.5.2 pypi_0 pypi
torchtext 0.5.0 pypi_0 pypi
torchvision 0.5.0 py37_cu101 pytorch
tornado 6.0.4 py37h8f50634_1 conda-forge
tqdm 4.43.0 pypi_0 pypi
traitlets 4.3.3 py37hc8dfbb8_1 conda-forge
transformers 2.5.1 pypi_0 pypi
urllib3 1.25.8 pypi_0 pypi
wcwidth 0.1.8 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
werkzeug 1.0.0 pypi_0 pypi
wheel 0.34.2 py37_0
widgetsnbextension 3.5.1 pypi_0 pypi
xz 5.2.4 h14c3975_4
zeromq 4.3.2 he1b5a44_2 conda-forge
zipp 3.1.0 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
If you run this in your python console,
import torch
print(torch.cuda.device_count())
what number does it return?
@awaelchli Thanks for your fast reply! It returns 1.
And I think it's because of the version of pytorch. I tried to change my pytorch version from py3.7_cuda10.1.243_cudnn7.6.3_0 to py3.6_cuda10.0.130_cudnn7.6.3_0. And the problem was solved.
Strange. So the code that I posted prints 0 with the first version, py3.7_cuda10.1.243_cudnn7.6.3_0? Could it be possible that your GPU supports cuda 10 but not 10.1? Just a wild guess.
No, the code that you posted still prints 1 with the first version. (py3.7_cuda10.1.243_cudnn7.6.3_0)
It's work fine when using only pytorch to train. But when I tried to rewrite it into the LightningModule template then I got this error. That also makes me confused.
That's my mistake. There's nothing to do with the version of Pytorch.
I set CUDA_VISIBLE_DEVICES to 1 in my program, but I only have one GPU on my machine. :(
So When I run the code that you posted in my python console it returns 1. But when I executed my program, there's no GPU that can be detected.
Thank you for replying!
That's a relief. Glad it was just this and nothing more complicated.
cheers!
@awaelchli Checked both this thread and this issue as well: https://github.com/PyTorchLightning/pytorch-lightning/issues/899
For me (Python: 3.7.5)
import torch
print(torch.cuda.device_count())
returns 0 on a DGX-1 (pascal generation) with cuda-10.1 and cuda-10.2 versions of pytorch 1.5.0. I did not find a cuda-10.0 version so I am currently working with pytorch 1.4.0 and cuda-10.0. "print(torch.cuda.device_count())" shows 8 with cuda-10.0 and pytorch 1.4.0
I am not setting CUDA_VISIBLE_DEVICES in my code.
Given what you describe here I can only conclude this is a PyTorch issue between versions 1.4 and 1.5. Have you checked their issue page?
@awaelchli Sorry, I should have tested this and mentioned this before. Works for cuda-9.2 version of pytorch 1.5.0, which does make me think its a cuda issue.
Training with a GPU works for me in pytorch, and pytorch lightning. But when I use ray.tune to search the hyperparameter space it dies with this exact error:
2020-09-01 13:39:29,157 ERROR trial_runner.py:523 -- Trial DEFAULT_2d46f_00009: Error processing event.
Traceback (most recent call last):
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=57762, ip=130.20.133.226)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/trainable.py", line 332, in train
result = self.step()
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 337, in step
self._report_thread_runner_error(block=True)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 456, in _report_thread_runner_error
.format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=57762, ip=130.20.133.226)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 224, in run
self._entrypoint()
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 287, in entrypoint
self._status_reporter.get_checkpoint())
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 507, in _trainable_func
output = train_func(config)
File "
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in __init__
self.data_parallel_device_ids = _parse_gpu_ids(self.gpus)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 451, in _parse_gpu_ids
gpus = sanitize_gpu_ids(gpus)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 410, in sanitize_gpu_ids
""")
pytorch_lightning.utilities.exceptions.MisconfigurationException:
You requested GPUs: [0]
And:
torch.cuda.is_available()
True
torch.__version__
'1.3.1'
torch.cuda.device_count()
1
pytorch_lightning.__version__
'0.6.1.dev'
CUDA_VISIBLE_DEVICES=6 on an 8-gpu machine.
I would say it as ray.tune, but it fails inside pytorch_lightning.
Any thoughts?
Training with a GPU works for me in pytorch, and pytorch lightning. But when I use ray.tune to search the hyperparameter space it dies with this exact error:
2020-09-01 13:39:29,157 ERROR trial_runner.py:523 -- Trial DEFAULT_2d46f_00009: Error processing event.
Traceback (most recent call last):
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=57762, ip=130.20.133.226)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/trainable.py", line 332, in train
result = self.step()
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 337, in step
self._report_thread_runner_error(block=True)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 456, in _report_thread_runner_error
.format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=57762, ip=130.20.133.226)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 224, in run
self._entrypoint()
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 287, in entrypoint
self._status_reporter.get_checkpoint())
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/ray/tune/function_runner.py", line 507, in _trainable_func
output = train_func(config)
File "", line 28, in train_tune
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in init
self.data_parallel_device_ids = _parse_gpu_ids(self.gpus)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 451, in _parse_gpu_ids
gpus = sanitize_gpu_ids(gpus)
File "/home/d3p692/anaconda3/envs/transformers/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 410, in sanitize_gpu_ids
""")
pytorch_lightning.utilities.exceptions.MisconfigurationException:
You requested GPUs: [0]
But your machine only has: []
And:
torch.cuda.is_available()
Truetorch.version
'1.3.1'torch.cuda.device_count()
1pytorch_lightning.version
'0.6.1.dev'CUDA_VISIBLE_DEVICES=6 on an 8-gpu machine.
I would say it as ray.tune, but it fails inside pytorch_lightning.
Any thoughts?
Hi, I just encountered the exact same issue. I wander if you have found a solution to fix it. Thanks :-)
Hi, I just encountered the exact same issue. I wander if you have found a solution to fix it. Thanks :-)
I think you should set "CUDA_VISIBLE_DEVICES" in the environment variables;
Here is the introduction from Ray[tune] official website.