Pytorch-lightning: Issue with EarlyStopping Callback on TPU runtime: Input tensor is not an XLA tensor: torch.FloatTensor

Created on 14 May 2020  路  2Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

A comparison is made between torch.FloatTensor and XLA tensor in pytorch_lightning/callbacks/early_stopping.py

Code sample

Exception in device=TPU:2: torch_xla/csrc/aten_xla_bridge.cpp:69 : Check failed: xtensor 
*** Begin stack trace ***
    tensorflow::CurrentStackTrace[abi:cxx11]()
    torch_xla::bridge::GetXlaTensor(at::Tensor const&)
    torch_xla::AtenXlaType::lt(at::Tensor const&, at::Tensor const&)
    c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&)

    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyFunction_FastCallDict

    PyObject_Call
    _PyEval_EvalFrameDefault

    PyObject_Call
    _PyEval_EvalFrameDefault

    PyObject_Call
    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault
    _PyFunction_FastCallDict

    _PyObject_FastCallKeywords

    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault

    _PyFunction_FastCallDict

    PyObject_Call
    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault
    _PyEval_EvalFrameDefault

    PyObject_Call
    _PyEval_EvalFrameDefault

    PyObject_Call
    _PyEval_EvalFrameDefault

    _PyEval_EvalFrameDefault

*** End stack trace ***
Input tensor is not an XLA tensor: torch.FloatTensor
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 453, in run_training_epoch
    self.call_early_stop_callback()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 793, in call_early_stop_callback
    self.early_stop_callback.on_epoch_end(self, self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/early_stopping.py", line 122, in on_epoch_end
    if self.monitor_op(current - self.min_delta, self.best):
RuntimeError: torch_xla/csrc/aten_xla_bridge.cpp:69 : Check failed: xtensor 

Environment

  • PyTorch Version (e.g., 1.0): 1.4.0a0+c272758
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Any other relevant information: Colab TPU runtime

TPU bug / fix help wanted

Most helpful comment

I've observed this on GPU as well on PyTorch 1.2 and Lightning 0.7.5.
I also think it's a bug.
It's really that the torch_inf tensor on early_stopping.py:16 is on CPU instead of GPU/TPU

I fixed mine locally by modifying on_train_start as such:

def on_train_start(self, trainer, pl_module):
    # Allow instances to be re-used
    self.wait = 0
    self.stopped_epoch = 0
    self.best = torch_inf if self.monitor_op == torch.lt else -torch_inf

    if trainer.on_gpu :
        #this probably only works on single gpu
        self.best = self.best.cuda()

So my guess is that you can do something similar, adding something like:

    if trainer.on_tpu :
        tpu_device = xm.xla_device()
        self.best = self.best.to(tpu_device)

just making sure you have torch_xla_py.xla_model as xm, and whatever other requirements TPU need. I don't know enough about TPUs.
There's XLA_AVAILABLE on trainer, maybe you can use that.

All 2 comments

I've observed this on GPU as well on PyTorch 1.2 and Lightning 0.7.5.
I also think it's a bug.
It's really that the torch_inf tensor on early_stopping.py:16 is on CPU instead of GPU/TPU

I fixed mine locally by modifying on_train_start as such:

def on_train_start(self, trainer, pl_module):
    # Allow instances to be re-used
    self.wait = 0
    self.stopped_epoch = 0
    self.best = torch_inf if self.monitor_op == torch.lt else -torch_inf

    if trainer.on_gpu :
        #this probably only works on single gpu
        self.best = self.best.cuda()

So my guess is that you can do something similar, adding something like:

    if trainer.on_tpu :
        tpu_device = xm.xla_device()
        self.best = self.best.to(tpu_device)

just making sure you have torch_xla_py.xla_model as xm, and whatever other requirements TPU need. I don't know enough about TPUs.
There's XLA_AVAILABLE on trainer, maybe you can use that.

Hey @edirgarcia, mind sending out a PR to fix?

Was this page helpful?
0 / 5 - 0 ratings