Pytorch-lightning: Hanging when importing pytorch_lightning on google cloud vm.

Created on 23 Oct 2020  ·  10Comments  ·  Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

Hi,

I was trying to use pytorch-lightning with TPU on google cloud virtual machine and the virtual machine is created by this command:

gcloud compute instances create tpu-vm \
       --machine-type=n1-standard-4 \
       --image-project=ml-images \
       --image-family=torch-xla \
       --boot-disk-size=200GB \
   --scopes=cloud-platform

When I try to import pytorch_lightning, it hanged all the time. I tried on both jupyter notebook and python in terminal, the results are the same. When I KeyboardInterrupt the code, it returns the logs below. It seems that it has something to do with the multiprocessing part. I have tried different installation methods but still have this problem. Could you help me to solve this problem? Any ideas or help would be much appreciated! Thanks!

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-1-702a026384b9> in <module>
      1 import torch_xla.core.xla_model as xm
----> 2 import pytorch_lightning as pl
      3 # from pytorch_lightning import Trainer, seed_everything

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py in <module>
     54     # We are not importing the rest of the lightning during the build process, as it may not be compiled yet
     55 else:
---> 56     from pytorch_lightning.core import LightningDataModule, LightningModule
     57     from pytorch_lightning.callbacks import Callback
     58     from pytorch_lightning.trainer import Trainer

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py in <module>
     14 
     15 from pytorch_lightning.core.datamodule import LightningDataModule
---> 16 from pytorch_lightning.core.lightning import LightningModule
     17 
     18 __all__ = [

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py in <module>
     44 
     45 
---> 46 TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
     47 
     48 if TPU_AVAILABLE:

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in tpu_device_exists()
     88         """
     89         if XLADeviceUtils.TPU_AVAILABLE is None and TORCHXLA_AVAILABLE:
---> 90             XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
     91         return XLADeviceUtils.TPU_AVAILABLE

/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py in wrapper(*args, **kwargs)
     41         proc = Process(target=inner_f, args=(queue, func,), kwargs=kwargs)
     42         proc.start()
---> 43         proc.join()
     44         return queue.get()
     45 

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py in join(self, timeout)
    122         assert self._parent_pid == os.getpid(), 'can only join a child process'
    123         assert self._popen is not None, 'can only join a started process'
--> 124         res = self._popen.wait(timeout)
    125         if res is not None:
    126             _children.discard(self)

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in wait(self, timeout)
     48                     return None
     49             # This shouldn't block if wait() returned successfully.
---> 50             return self.poll(os.WNOHANG if timeout == 0.0 else 0)
     51         return self.returncode
     52 

/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py in poll(self, flag)
     26             while True:
     27                 try:
---> 28                     pid, sts = os.waitpid(self.pid, flag)
     29                 except OSError as e:
     30                     # Child process not yet created. See #1731717

KeyboardInterrupt: 
Priority P0 TPU bug / fix

All 10 comments

@lezwon this is the utilities function you recently added, right? Maybe you need to put a timeout on the join()?

Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?

from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils
XLADeviceUtils._is_device_tpu()

Sure. I'll add a timeout. However, I'm curious to know why the process hangs. @L4zyy mind executing this in the console and letting us know the output?

from pytorch_lightning.utilities.xla_device_utils import XLADeviceUtils
XLADeviceUtils._is_device_tpu()

It will hang every time I import pytorch_lightning. When I ran the first line of this code, I got hang and when I interupt it with Ctrl C, I got:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/__init__.py", line 56, in <module>
    from pytorch_lightning.core import LightningDataModule, LightningModule
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 46, in <module>
    TPU_AVAILABLE = XLADeviceUtils.tpu_device_exists()
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 90, in tpu_device_exists
    XLADeviceUtils.TPU_AVAILABLE = pl_multi_process(XLADeviceUtils._is_device_tpu)()
  File "/anaconda3/envs/ldetr/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device_utils.py", line 43, in wrapper
    proc.join()
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/anaconda3/envs/ldetr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)

Sorry about that. Try this one. It does not import Lightning.

import torch_xla.core.xla_model as xm
device = xm.xla_device()
xm.xla_device_hw(device)

Sorry about that. Try this one. It does not import Lightning.

import torch_xla.core.xla_model as xm
device = xm.xla_device()
xm.xla_device_hw(device)

I find out the reason of hanging. I forgot to export the TPU_IP_ADDRESS before importing. When I forgot to export the variable, the device = ... line will hang and can not be interupted. After I export the variable, I can get device normally and the third line will return 'TPU'. Also I can successfully import pytorch_lightning after I export the variable.

that's awesome 👍 Glad it worked :)

It seems we don't have the TPU_IP_ADDRESS anywhere in our codebase. Is it something we should document, or this is a system configuration with xla that the user has to do?

It's present in cloud docs. Maybe we could add it to our docs too?

@awaelchli I was wondering if we should lazy load TPU_AVAILABLE. Maybe only when the user selects to train on TPU's. That way we can ensure the user is able to load lightning without running into errors as such.

Yes, I think that would be the best :)

Was this page helpful?
0 / 5 - 0 ratings