Pytorch-lightning: TPU error

Created on 10 Oct 2020  路  13Comments  路  Source: PyTorchLightning/pytorch-lightning

Hi,
I am getting a TPU error on Colab and I am using the latest version of lightning.
Notebook

Trainer:

trainer = pl.Trainer(tpu_cores=8, precision=16, logger=logger, checkpoint_callback=checkpoint_callback, progress_bar_refresh_rate=50, accumulate_grad_batches=2, fast_dev_run=False,\
                    default_root_dir=root_path, auto_lr_find=True, gradient_clip_val=0.5,\
                    profiler=True,  max_epochs=1000, callbacks=[lr_monitor, early_stop, PrintTableMetricsCallback()])

Stack trace:

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
Using native 16bit precision.
training on 8 TPU cores
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=1
Exception in device=TPU:0: dictionary update sequence element #0 has length 1; 2 is required
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 122, in tpu_train_in_process
    self.trainer.train_loop.setup_training(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 132, in setup_training
    self.trainer.logger.save()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 220, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/saving.py", line 378, in save_hparams_to_yaml
    yaml.dump(hparams, fp)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 290, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 278, in dump_all
    dumper.represent(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 27, in represent
    node = self.represent_data(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 207, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 343, in represent_object
    'tag:yaml.org,2002:python/object:'+function_name, state)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
ValueError: dictionary update sequence element #0 has length 1; 2 is required
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-16-2e5877a52826> in <module>()
      4 trainer = pl.Trainer(tpu_cores=8, precision=16, logger=logger, checkpoint_callback=checkpoint_callback, progress_bar_refresh_rate=50, accumulate_grad_batches=2, fast_dev_run=False,                    default_root_dir=root_path, auto_lr_find=True, gradient_clip_val=0.5,                    profiler=True,  max_epochs=1000, callbacks=[lr_monitor, early_stop, PrintTableMetricsCallback()])
      5 
----> 6 trainer.fit(model_one)

4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    420         self.call_hook('on_fit_start')
    421 
--> 422         results = self.accelerator_backend.train()
    423         self.accelerator_backend.teardown()
    424 

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py in train(self)
     95                 args=(model, self.trainer, self.mp_queue),
     96                 nprocs=self.trainer.tpu_cores,
---> 97                 start_method=self.start_method
     98             )
     99 

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    393         join=join,
    394         daemon=daemon,
--> 395         start_method=start_method)
    396 
    397 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    155 
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass
    159 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    110                 raise Exception(
    111                     "process %d terminated with exit code %d" %
--> 112                     (error_index, exitcode)
    113                 )
    114 

Exception: process 0 terminated with exit code 17
TPU bug / fix help wanted

All 13 comments

Hi! thanks for your contribution!, great first issue!

As per the XLA troubleshooting guide:

Tensor shapes should be the same between iterations, or a low number of shape variations should be used.

In this application the images shapes are different for each batch. Could that be the issue?

Seems like there's an issue while trying to save the hyperparams. @rohitgr7 any idea about this?
Also I am unable to run the entire notebook as the data is present in Google Drive.

yeah maybe, if either we can somehow get the data or just input data shape, maybe we can reproduce this and check the real issue here.

@rohitgr7 @lezwon Images are of different shape ranging form a few pixels to 65k pixels per channel. The images are padded in a batch to make them of an equal shape. So, the batch shapes would differ.

@ishgirwan mind trying by excluding the model from saving in hyperparameters? just call self.save_hyperparameters('batch_size', 'learning_rate') instead of self.save_hyperparameters().

@lezwon It seems to be working. But I got this error. It may be a RAM issue as the model was training for some time for a batch size of 8 on Colab(15GB). I am new to TPUs so how should I manage the batch size wrt the number of cores? Also, what was the issue initially? Thanks a lot for your help.

training on 8 TPU cores
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=None
INIT TPU local core: 4, global rank: 4 with XLA_USE_BF16=None
INIT TPU local core: 3, global rank: 3 with XLA_USE_BF16=None
INIT TPU local core: 7, global rank: 7 with XLA_USE_BF16=None
INIT TPU local core: 2, global rank: 2 with XLA_USE_BF16=None
INIT TPU local core: 1, global rank: 1 with XLA_USE_BF16=None
INIT TPU local core: 5, global rank: 5 with XLA_USE_BF16=None
INIT TPU local core: 6, global rank: 6 with XLA_USE_BF16=None

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11 M  
Validation sanity check: 0%
0/2 [00:55<?, ?it/s]
Epoch 0: 18%
50/281 [13:45<1:03:33, 16.51s/it, loss=0.013, v_num=66, train_loss_step=0.0138, train_loss=0.0138]
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 875, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 141, in _loader_worker
    _, data = next(data_iter)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 438, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1071, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1037, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 888, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1816) exited unexpectedly
Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 875, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 141, in _loader_worker
    _, data = next(data_iter)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 438, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1071, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1037, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 888, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1851, 1856) exited unexpectedly


---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-15-fbb0322b0ef6> in <module>()
----> 1 trainer.fit(model_one)

4 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    150                     error_pid=failed_process.pid,
    151                     exit_code=exitcode,
--> 152                     signal_name=name
    153                 )
    154             else:

ProcessExitedException: process 3 terminated with signal SIGKILL 

I think the model could not be parsed to yaml, as showing in the error. Probably avoid storing complex objects like models within hyperparams. :) Also to debug the above error, I'll need access to the notebook with the data. It's very hard to guess what might be going wrong without having a reproducible notebook.

Thanks :) Can we connect on the pytorch lightning slack channel? I can dm you the details over there.

Sure 馃憤

@ishgirwan I had a look at the notebook. I strongly suspect colab runs out of memory during the training, due to the dynamic size of your inputs. XLA will compile a new graph for every example if it has a different size than the others. Reference. Try padding your inputs to maintain a consistent size. Let me know if that works.

Also I dont think this optimizer_step function needs to be defined. Lightning handles this behind the scenes and calls xm.optimizer_step(optimizer) when training on TPU's. 馃憤

def optimizer_step(self, current_epoch, batch_idx, optimizer, 
      optimizer_idx, second_order_closure=None, 
       on_tpu=False, using_native_amp=False, using_lbfgs=False):
        optimizer.step()

Thanks a lot @lezwon for looking into this issue. As you mentioned, the issue is probably with the way XLA functions, since the code has worked properly with GPU. Padding would work as the images will then be of equal size. I will close this issue now.

Also, I have hosted the dataset at Kaggle. Thanks again.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

anthonytec2 picture anthonytec2  路  3Comments

remisphere picture remisphere  路  3Comments

awaelchli picture awaelchli  路  3Comments

monney picture monney  路  3Comments

williamFalcon picture williamFalcon  路  3Comments