A comparison is made between torch.FloatTensor and XLA tensor in pytorch_lightning/callbacks/early_stopping.py
Exception in device=TPU:2: torch_xla/csrc/aten_xla_bridge.cpp:69 : Check failed: xtensor
*** Begin stack trace ***
tensorflow::CurrentStackTrace[abi:cxx11]()
torch_xla::bridge::GetXlaTensor(at::Tensor const&)
torch_xla::AtenXlaType::lt(at::Tensor const&, at::Tensor const&)
c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&)
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
_PyObject_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
*** End stack trace ***
Input tensor is not an XLA tensor: torch.FloatTensor
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
self.train()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 453, in run_training_epoch
self.call_early_stop_callback()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 793, in call_early_stop_callback
self.early_stop_callback.on_epoch_end(self, self.get_model())
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/early_stopping.py", line 122, in on_epoch_end
if self.monitor_op(current - self.min_delta, self.best):
RuntimeError: torch_xla/csrc/aten_xla_bridge.cpp:69 : Check failed: xtensor
I've observed this on GPU as well on PyTorch 1.2 and Lightning 0.7.5.
I also think it's a bug.
It's really that the torch_inf tensor on early_stopping.py:16 is on CPU instead of GPU/TPU
I fixed mine locally by modifying on_train_start as such:
def on_train_start(self, trainer, pl_module):
# Allow instances to be re-used
self.wait = 0
self.stopped_epoch = 0
self.best = torch_inf if self.monitor_op == torch.lt else -torch_inf
if trainer.on_gpu :
#this probably only works on single gpu
self.best = self.best.cuda()
So my guess is that you can do something similar, adding something like:
if trainer.on_tpu :
tpu_device = xm.xla_device()
self.best = self.best.to(tpu_device)
just making sure you have torch_xla_py.xla_model as xm, and whatever other requirements TPU need. I don't know enough about TPUs.
There's XLA_AVAILABLE on trainer, maybe you can use that.
Hey @edirgarcia, mind sending out a PR to fix?
Most helpful comment
I've observed this on GPU as well on PyTorch 1.2 and Lightning 0.7.5.
I also think it's a bug.
It's really that the torch_inf tensor on early_stopping.py:16 is on CPU instead of GPU/TPU
I fixed mine locally by modifying on_train_start as such:
So my guess is that you can do something similar, adding something like:
just making sure you have torch_xla_py.xla_model as xm, and whatever other requirements TPU need. I don't know enough about TPUs.
There's XLA_AVAILABLE on trainer, maybe you can use that.