Hi,
I am getting a TPU error on Colab and I am using the latest version of lightning.
Notebook
Trainer:
trainer = pl.Trainer(tpu_cores=8, precision=16, logger=logger, checkpoint_callback=checkpoint_callback, progress_bar_refresh_rate=50, accumulate_grad_batches=2, fast_dev_run=False,\
default_root_dir=root_path, auto_lr_find=True, gradient_clip_val=0.5,\
profiler=True, max_epochs=1000, callbacks=[lr_monitor, early_stop, PrintTableMetricsCallback()])
Stack trace:
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
Using native 16bit precision.
training on 8 TPU cores
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=1
Exception in device=TPU:0: dictionary update sequence element #0 has length 1; 2 is required
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 122, in tpu_train_in_process
self.trainer.train_loop.setup_training(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 132, in setup_training
self.trainer.logger.save()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py", line 35, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 220, in save
save_hparams_to_yaml(hparams_file, self.hparams)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/saving.py", line 378, in save_hparams_to_yaml
yaml.dump(hparams, fp)
File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 290, in dump
return dump_all([data], stream, Dumper=Dumper, **kwds)
File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 278, in dump_all
dumper.represent(data)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 27, in represent
node = self.represent_data(data)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 207, in represent_dict
return self.represent_mapping('tag:yaml.org,2002:map', data)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 343, in represent_object
'tag:yaml.org,2002:python/object:'+function_name, state)
File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
ValueError: dictionary update sequence element #0 has length 1; 2 is required
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-16-2e5877a52826> in <module>()
4 trainer = pl.Trainer(tpu_cores=8, precision=16, logger=logger, checkpoint_callback=checkpoint_callback, progress_bar_refresh_rate=50, accumulate_grad_batches=2, fast_dev_run=False, default_root_dir=root_path, auto_lr_find=True, gradient_clip_val=0.5, profiler=True, max_epochs=1000, callbacks=[lr_monitor, early_stop, PrintTableMetricsCallback()])
5
----> 6 trainer.fit(model_one)
4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
420 self.call_hook('on_fit_start')
421
--> 422 results = self.accelerator_backend.train()
423 self.accelerator_backend.teardown()
424
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py in train(self)
95 args=(model, self.trainer, self.mp_queue),
96 nprocs=self.trainer.tpu_cores,
---> 97 start_method=self.start_method
98 )
99
/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
393 join=join,
394 daemon=daemon,
--> 395 start_method=start_method)
396
397
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
155
156 # Loop on join until it returns True or raises an exception.
--> 157 while not context.join():
158 pass
159
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
110 raise Exception(
111 "process %d terminated with exit code %d" %
--> 112 (error_index, exitcode)
113 )
114
Exception: process 0 terminated with exit code 17
Hi! thanks for your contribution!, great first issue!
As per the XLA troubleshooting guide:
Tensor shapes should be the same between iterations, or a low number of shape variations should be used.
In this application the images shapes are different for each batch. Could that be the issue?
Seems like there's an issue while trying to save the hyperparams. @rohitgr7 any idea about this?
Also I am unable to run the entire notebook as the data is present in Google Drive.
yeah maybe, if either we can somehow get the data or just input data shape, maybe we can reproduce this and check the real issue here.
@rohitgr7 @lezwon Images are of different shape ranging form a few pixels to 65k pixels per channel. The images are padded in a batch to make them of an equal shape. So, the batch shapes would differ.
@ishgirwan mind trying by excluding the model from saving in hyperparameters? just call self.save_hyperparameters('batch_size', 'learning_rate') instead of self.save_hyperparameters().
@lezwon It seems to be working. But I got this error. It may be a RAM issue as the model was training for some time for a batch size of 8 on Colab(15GB). I am new to TPUs so how should I manage the batch size wrt the number of cores? Also, what was the issue initially? Thanks a lot for your help.
training on 8 TPU cores
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=None
INIT TPU local core: 4, global rank: 4 with XLA_USE_BF16=None
INIT TPU local core: 3, global rank: 3 with XLA_USE_BF16=None
INIT TPU local core: 7, global rank: 7 with XLA_USE_BF16=None
INIT TPU local core: 2, global rank: 2 with XLA_USE_BF16=None
INIT TPU local core: 1, global rank: 1 with XLA_USE_BF16=None
INIT TPU local core: 5, global rank: 5 with XLA_USE_BF16=None
INIT TPU local core: 6, global rank: 6 with XLA_USE_BF16=None
| Name | Type | Params
---------------------------------
0 | model | ResNet | 11 M
Validation sanity check: 0%
0/2 [00:55<?, ?it/s]
Epoch 0: 18%
50/281 [13:45<1:03:33, 16.51s/it, loss=0.013, v_num=66, train_loss_step=0.0138, train_loss=0.0138]
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 875, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 141, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 438, in __next__
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1071, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1037, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 888, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1816) exited unexpectedly
Exception in thread Thread-10:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 875, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 141, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 438, in __next__
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1071, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1037, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 888, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1851, 1856) exited unexpectedly
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
<ipython-input-15-fbb0322b0ef6> in <module>()
----> 1 trainer.fit(model_one)
4 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
150 error_pid=failed_process.pid,
151 exit_code=exitcode,
--> 152 signal_name=name
153 )
154 else:
ProcessExitedException: process 3 terminated with signal SIGKILL
I think the model could not be parsed to yaml, as showing in the error. Probably avoid storing complex objects like models within hyperparams. :) Also to debug the above error, I'll need access to the notebook with the data. It's very hard to guess what might be going wrong without having a reproducible notebook.
Thanks :) Can we connect on the pytorch lightning slack channel? I can dm you the details over there.
Sure 馃憤
@ishgirwan I had a look at the notebook. I strongly suspect colab runs out of memory during the training, due to the dynamic size of your inputs. XLA will compile a new graph for every example if it has a different size than the others. Reference. Try padding your inputs to maintain a consistent size. Let me know if that works.
Also I dont think this optimizer_step function needs to be defined. Lightning handles this behind the scenes and calls xm.optimizer_step(optimizer) when training on TPU's. 馃憤
def optimizer_step(self, current_epoch, batch_idx, optimizer,
optimizer_idx, second_order_closure=None,
on_tpu=False, using_native_amp=False, using_lbfgs=False):
optimizer.step()
Thanks a lot @lezwon for looking into this issue. As you mentioned, the issue is probably with the way XLA functions, since the code has worked properly with GPU. Padding would work as the images will then be of equal size. I will close this issue now.
Also, I have hosted the dataset at Kaggle. Thanks again.