The TPU Colab live demo pointed in documentation is not working.
Is there a new one?
@Ceceu In order to work, you need to replace RSION in the second cell with
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev
https://pytorch-lightning.readthedocs.io/en/latest/tpu.html#colab-tpus
Great, it's working.
In case it can help someone in the future, I also had to update
trainer = Trainer(num_tpu_cores=8, ...) to trainer = Trainer(tpu_cores=8, ...)
@williamFalcon need to fix colab
@williamFalcon need to fix colab
Even insert the suggested changes, the training does not end correctly. See error tracing:
Exception in device=TPU:1: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:5: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:4: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:3: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Exception in device=TPU:7: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py", line 118, in tpu_train_in_process
trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 425, in transfer_distrib_spawn_state_on_fit_end
if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-5-76fa9fb4398c> in <module>()
13 # most basic trainer, uses good defaults
14 trainer = Trainer(tpu_cores=8, progress_bar_refresh_rate=20, max_epochs=10)
---> 15 trainer.fit(model, mnist_train, mnist_val)
4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
1042 self.accelerator_backend = TPUBackend(self)
1043 self.accelerator_backend.setup()
-> 1044 self.accelerator_backend.train(model)
1045 self.accelerator_backend.teardown(model)
1046
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerator_backends/tpu_backend.py in train(self, model)
85 args=(model, self.trainer, self.mp_queue),
86 nprocs=self.trainer.tpu_cores,
---> 87 start_method=self.start_method
88 )
89
/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
393 join=join,
394 daemon=daemon,
--> 395 start_method=start_method)
396
397
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115
Exception: process 1 terminated with exit code 17
@nateraw will move the notebook to this repo and create automatic load to Colab
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
Most helpful comment
@nateraw will move the notebook to this repo and create automatic load to Colab