I am trying to train my model using Colab TPUs, but I am getting the following error and am a bit baffled. Any help or guidance would be greatly appreciated.
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/decorators.py:13: UserWarning: data_loader decorator deprecated in 0.7.0. Will remove 0.9.0
warnings.warn(w)
2020-03-10 01:18:06.634594: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.638992: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.645716: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.653893: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.660726: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.663941: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:06.669584: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57985
2020-03-10 01:18:15.441052: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session.Run({tensorflow::Output(result, 0)}, &outputs) == ::tensorflow::Status::OK() (Internal: From /job:tpu_worker/replica:0/task:0:
Global core array does not contain host cores [0x0x0_TC0,0x0x0_TC1,1x0x0_TC0,1x0x0_TC1,0x1x0_TC0,0x1x0_TC1,1x1x0_TC0,1x1x0_TC1]
[[{{node configure_distributed_tpu/_0}}]] vs. OK)
*** Begin stack trace ***
tensorflow::CurrentStackTrace[abi:cxx11]()
xla::XrtComputationClient::InitializeAndFetchTopology(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::ComputationClient::Create()
xla::ComputationClient::Get()
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
_PyObject_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyEval_EvalCode
PyRun_FileExFlags
PyRun_SimpleFileExFlags
Py_Main
main
__libc_start_main
_start
*** End stack trace ***
Traceback (most recent call last):
File "main.py", line 114, in <module>
main(hparams)
File "main.py", line 95, in main
trainer.fit(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 614, in fit
xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 116, in _start_fn
_setup_replication()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 108, in _setup_replication
device = xm.xla_device()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 137, in xla_device
devkind=[devkind] if devkind is not None else None)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 41, in get_xla_supported_devices
xla_devices = torch_xla._XLAC._xla_get_devices()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1198 : Check failed: session.Run({tensorflow::Output(result, 0)}, &outputs) == ::tensorflow::Status::OK() (Internal: From /job:tpu_worker/replica:0/task:0:
Global core array does not contain host cores [0x0x0_TC0,0x0x0_TC1,1x0x0_TC0,1x0x0_TC1,0x1x0_TC0,0x1x0_TC1,1x1x0_TC0,1x1x0_TC1]
[[{{node configure_distributed_tpu/_0}}]] vs. OK)
*** Begin stack trace ***
tensorflow::CurrentStackTrace[abi:cxx11]()
xla::XrtComputationClient::InitializeAndFetchTopology(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >)
xla::ComputationClient::Create()
xla::ComputationClient::Get()
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
_PyObject_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyEval_EvalCode
PyRun_FileExFlags
PyRun_SimpleFileExFlags
Py_Main
main
__libc_start_main
_start
*** End stack trace ***
Hi! thanks for your contribution!, great first issue!
It seems I was missing an import for my dataloader. Not sure why it displayed the way it did though.
Hi @GuardedAirplane,
I am running into similar problems. Could you specify which import was missing?
Thanks
I usually just restart the environment. Seems to be a bug with colab somehow.
@dlibenzi
Yes, there was a bug introduced by a CL we reverted yesterday.
Today's nightly should be capturing it.
Most helpful comment
Yes, there was a bug introduced by a CL we reverted yesterday.
Today's nightly should be capturing it.
https://github.com/pytorch/xla/issues/1731