Pytorch-lightning: TPU Colab

Created on 4 Aug 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

The TPU Colab live demo pointed in documentation is not working.

Is there a new one?

documentation question won't fix

Source

Ceceu

Most helpful comment

@nateraw will move the notebook to this repo and create automatic load to Colab

Borda on 4 Aug 2020

🚀3

All 6 comments

@Ceceu In order to work, you need to replace RSION in the second cell with

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

https://pytorch-lightning.readthedocs.io/en/latest/tpu.html#colab-tpus

ydcjeff on 4 Aug 2020

🎉1 👍1

Great, it's working.
In case it can help someone in the future, I also had to update
trainer = Trainer(num_tpu_cores=8, ...) to trainer = Trainer(tpu_cores=8, ...)

Ceceu on 4 Aug 2020

👍1

@williamFalcon need to fix colab

edenlightning on 4 Aug 2020

@williamFalcon need to fix colab

Even insert the suggested changes, the training does not end correctly. See error tracing:

Exception in device=TPU:1: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:5: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:4: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:3: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:7: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-76fa9fb4398c> in <module>()
     13 # most basic trainer, uses good defaults
     14 trainer = Trainer(tpu_cores=8, progress_bar_refresh_rate=20, max_epochs=10)
---> 15 trainer.fit(model, mnist_train, mnist_val)

4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1042             self.accelerator_backend = TPUBackend(self)
   1043             self.accelerator_backend.setup()
-> 1044             self.accelerator_backend.train(model)
   1045             self.accelerator_backend.teardown(model)
   1046 

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py in train(self, model)
     85                 args=(model, self.trainer, self.mp_queue),
     86                 nprocs=self.trainer.tpu_cores,
---> 87                 start_method=self.start_method
     88             )
     89 

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    393         join=join,
    394         daemon=daemon,
--> 395         start_method=start_method)
    396 
    397 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 1 terminated with exit code 17

Ceceu on 4 Aug 2020

👍1

@nateraw will move the notebook to this repo and create automatic load to Colab

Borda on 4 Aug 2020

🚀3

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!