Pytorch-lightning: TPU Colab

Created on 4 Aug 2020  路  6Comments  路  Source: PyTorchLightning/pytorch-lightning

The TPU Colab live demo pointed in documentation is not working.

Is there a new one?

documentation question won't fix

Most helpful comment

@nateraw will move the notebook to this repo and create automatic load to Colab

All 6 comments

@Ceceu In order to work, you need to replace RSION in the second cell with

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

https://pytorch-lightning.readthedocs.io/en/latest/tpu.html#colab-tpus

Great, it's working.
In case it can help someone in the future, I also had to update
trainer = Trainer(num_tpu_cores=8, ...) to trainer = Trainer(tpu_cores=8, ...)

@williamFalcon need to fix colab

@williamFalcon need to fix colab

Even insert the suggested changes, the training does not end correctly. See error tracing:

Exception in device=TPU:1: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:5: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:4: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:3: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:7: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-76fa9fb4398c> in <module>()
     13 # most basic trainer, uses good defaults
     14 trainer = Trainer(tpu_cores=8, progress_bar_refresh_rate=20, max_epochs=10)
---> 15 trainer.fit(model, mnist_train, mnist_val)

4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1042             self.accelerator_backend = TPUBackend(self)
   1043             self.accelerator_backend.setup()
-> 1044             self.accelerator_backend.train(model)
   1045             self.accelerator_backend.teardown(model)
   1046 

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py in train(self, model)
     85                 args=(model, self.trainer, self.mp_queue),
     86                 nprocs=self.trainer.tpu_cores,
---> 87                 start_method=self.start_method
     88             )
     89 

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    393         join=join,
    394         daemon=daemon,
--> 395         start_method=start_method)
    396 
    397 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 1 terminated with exit code 17

@nateraw will move the notebook to this repo and create automatic load to Colab

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Was this page helpful?
0 / 5 - 0 ratings