Hi,
I was trying to use pytorch-lightning with TPU on google cloud virtual machine and the virtual machine is created by this command:
gcloud compute instances create tpu-vm \
--machine-type=n1-standard-4 \
--image-project=ml-images \
--image-family=torch-xla \
--boot-disk-size=200GB \
--scopes=cloud-platform
When I try to import pytorch_lightning, it hanged all the time. I tried on both jupyter notebook and python in terminal, the results are the same. When I KeyboardInterrupt the code, it returns the logs below. It seems that it has something to do with the multiprocessing part. I have tried different installation methods but still have this problem. Could you help me to solve this problem? Any ideas or help would be much appreciated! Thanks!
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-1-702a026384b9> in <module>
1 import torch_xla.core.xla_model as xm
----> 2 import pytorch_lightning as pl
3 # from pytorch_lightning import Trainer, seed_everything
/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py in <module>
54 # We are not importing the rest of the lightning during the build process, as it may not be compiled yet
55 else:
---> 56 from pytorch_lightning.core import LightningDataModule, LightningModule
57 from pytorch_lightning.callbacks import Callback
58 from pytorch_lightning.trainer import Trainer
/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py in <module>
14
15 from pytorch_lightning.core.datamodule import LightningDataModule
---> 16 from pytorch_lightning.core.lightning import LightningModule
17
18 __all__ = [
/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py in <module>
44
45
---> 46 TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
47
48 if TPU_AVAILABLE:
/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in tpu_device_exists()
88 """
89 if XLADeviceUtils.TPU_AVAILABLE is None and TORCHXLA_AVAILABLE:
---> 90 XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
91 return XLADeviceUtils.TPU_AVAILABLE
/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in wrapper(*args, **kwargs)
41 proc = Process(target=inner_f, args=(queue, func,), kwargs=kwargs)
42 proc.start()
---> 43 proc.join()
44 return queue.get()
45
/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py in join(self, timeout)
122 assert self._parent_pid == os.getpid(), 'can only join a child process'
123 assert self._popen is not None, 'can only join a started process'
--> 124 res = self._popen.wait(timeout)
125 if res is not None:
126 _children.discard(self)
/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in wait(self, timeout)
48 return None
49 # This shouldn't block if wait() returned successfully.
---> 50 return self.poll(os.WNOHANG if timeout == 0.0 else 0)
51 return self.returncode
52
/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in poll(self, flag)
26 while True:
27 try:
---> 28 pid, sts = os.waitpid(self.pid, flag)
29 except OSError as e:
30 # Child process not yet created. See #1731717
KeyboardInterrupt:
@lezwon this is the utilities function you recently added, right? Maybe you need to put a timeout on the join()?
Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?
from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils
XLADeviceUtils._is_device_tpu()
Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?
from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils XLADeviceUtils._is_device_tpu()
It will hang every time I import pytorch_lightning. When I ran the first line of this code, I got hang and when I interupt it with Ctrl C, I got:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py", line 56, in <module>
from pytorch_lightning.core import LightningDataModule, LightningModule
File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
from pytorch_lightning.core.lightning import LightningModule
File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 46, in <module>
TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 90, in tpu_device_exists
XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 43, in wrapper
proc.join()
File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
Sorry about that. Try this one. It does not import Lightning.
import torch_xla.core.xla_model as xm
device = xm.xla_device()
xm.xla_device_hw(device)
Sorry about that. Try this one. It does not import Lightning.
import torch_xla.core.xla_model as xm device = xm.xla_device() xm.xla_device_hw(device)
I find out the reason of hanging. I forgot to export the TPU_IP_ADDRESS before importing. When I forgot to export the variable, the device = ... line will hang and can not be interupted. After I export the variable, I can get device normally and the third line will return 'TPU'. Also I can successfully import pytorch_lightning after I export the variable.
that's awesome 👍 Glad it worked :)
It seems we don't have the TPU_IP_ADDRESS anywhere in our codebase. Is it something we should document, or this is a system configuration with xla that the user has to do?
It's present in cloud docs. Maybe we could add it to our docs too?
@awaelchli I was wondering if we should lazy load TPU_AVAILABLE. Maybe only when the user selects to train on TPU's. That way we can ensure the user is able to load lightning without running into errors as such.
Yes, I think that would be the best :)