Pytorch-lightning: Hanging when importing pytorch_lightning on google cloud vm.

Created on 23 Oct 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

Hi,

I was trying to use pytorch-lightning with TPU on google cloud virtual machine and the virtual machine is created by this command:

gcloud compute instances create tpu-vm \
       --machine-type=n1-standard-4 \
       --image-project=ml-images \
       --image-family=torch-xla \
       --boot-disk-size=200GB \
   --scopes=cloud-platform

When I try to import pytorch_lightning, it hanged all the time. I tried on both jupyter notebook and python in terminal, the results are the same. When I KeyboardInterrupt the code, it returns the logs below. It seems that it has something to do with the multiprocessing part. I have tried different installation methods but still have this problem. Could you help me to solve this problem? Any ideas or help would be much appreciated! Thanks!

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-1-702a026384b9> in <module>
      1 import torch_xla.core.xla_model as xm
----> 2 import pytorch_lightning as pl
      3 # from pytorch_lightning import Trainer, seed_everything

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py in <module>
     54     # We are not importing the rest of the lightning during the build process, as it may not be compiled yet
     55 else:
---> 56     from pytorch_lightning.core import LightningDataModule, LightningModule
     57     from pytorch_lightning.callbacks import Callback
     58     from pytorch_lightning.trainer import Trainer

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py in <module>
     14 
     15 from pytorch_lightning.core.datamodule import LightningDataModule
---> 16 from pytorch_lightning.core.lightning import LightningModule
     17 
     18 __all__ = [

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py in <module>
     44 
     45 
---> 46 TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
     47 
     48 if TPU_AVAILABLE:

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in tpu_device_exists()
     88         """
     89         if XLADeviceUtils.TPU_AVAILABLE is None and TORCHXLA_AVAILABLE:
---> 90             XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
     91         return XLADeviceUtils.TPU_AVAILABLE

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in wrapper(*args, **kwargs)
     41         proc = Process(target=inner_f, args=(queue, func,), kwargs=kwargs)
     42         proc.start()
---> 43         proc.join()
     44         return queue.get()
     45 

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py in join(self, timeout)
    122         assert self._parent_pid == os.getpid(), 'can only join a child process'
    123         assert self._popen is not None, 'can only join a started process'
--> 124         res = self._popen.wait(timeout)
    125         if res is not None:
    126             _children.discard(self)

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in wait(self, timeout)
     48                     return None
     49             # This shouldn't block if wait() returned successfully.
---> 50             return self.poll(os.WNOHANG if timeout == 0.0 else 0)
     51         return self.returncode
     52 

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in poll(self, flag)
     26             while True:
     27                 try:
---> 28                     pid, sts = os.waitpid(self.pid, flag)
     29                 except OSError as e:
     30                     # Child process not yet created. See #1731717

KeyboardInterrupt:

Priority P0 TPU bug / fix

Source

L4zyy

All 10 comments

@lezwon this is the utilities function you recently added, right? Maybe you need to put a timeout on the join()?

awaelchli on 24 Oct 2020

Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?

from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils
XLADeviceUtils._is_device_tpu()

lezwon on 25 Oct 2020

Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?
from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils
XLADeviceUtils._is_device_tpu()

It will hang every time I import pytorch_lightning. When I ran the first line of this code, I got hang and when I interupt it with Ctrl C, I got:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py", line 56, in <module>
    from pytorch_lightning.core import LightningDataModule, LightningModule
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 46, in <module>
    TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 90, in tpu_device_exists
    XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 43, in wrapper
    proc.join()
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)

L4zyy on 25 Oct 2020

Sorry about that. Try this one. It does not import Lightning.

import torch_xla.core.xla_model as xm
device = xm.xla_device()
xm.xla_device_hw(device)

lezwon on 25 Oct 2020

Sorry about that. Try this one. It does not import Lightning.
import torch_xla.core.xla_model as xm
device = xm.xla_device()
xm.xla_device_hw(device)

I find out the reason of hanging. I forgot to export the TPU_IP_ADDRESS before importing. When I forgot to export the variable, the device = ... line will hang and can not be interupted. After I export the variable, I can get device normally and the third line will return 'TPU'. Also I can successfully import pytorch_lightning after I export the variable.

L4zyy on 25 Oct 2020

👍1

that's awesome 👍 Glad it worked :)

lezwon on 25 Oct 2020

It seems we don't have the TPU_IP_ADDRESS anywhere in our codebase. Is it something we should document, or this is a system configuration with xla that the user has to do?

awaelchli on 25 Oct 2020

👍1

It's present in cloud docs. Maybe we could add it to our docs too?

lezwon on 25 Oct 2020

👍1

@awaelchli I was wondering if we should lazy load TPU_AVAILABLE. Maybe only when the user selects to train on TPU's. That way we can ensure the user is able to load lightning without running into errors as such.

lezwon on 25 Oct 2020

👍1

Yes, I think that would be the best :)

awaelchli on 25 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings