When I try to train a model on Kaggle TPU's with num_tpu_cores set to 8, I receive an error Exception: process 2 terminated with exit code 1 . Would be great if this worked on kaggle.
Steps to reproduce the behavior:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-9-9251330963d1> in <module>
3 # most basic trainer, uses good defaults (1 TPU)
4 trainer = pl.Trainer(num_tpu_cores=8)
----> 5 trainer.fit(mnist_model)
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
714
715 # train
--> 716 xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
717
718 # load weights if not interrupted
/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
180 join=join,
181 daemon=daemon,
--> 182 start_method=start_method)
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115
Exception: process 3 terminated with exit code 1
trainer = pl.Trainer(num_tpu_cores=8, precision=16)
Run the model utilizing all 8 TPU cores.
cuda:
GPU:
available: False
version: None
packages:
numpy: 1.18.2
pyTorch_debug: False
pyTorch_version: 1.6.0a0+30e7055
pytorch-lightning: 0.7.3
tensorboard: 2.1.1
tqdm: 4.42.0
system:
OS: Linux
architecture:
64bit
processor:
python: 3.6.6
version: #1 SMP Sat Apr 4 00:12:45 PDT 2020
I think this is a kaggle problem?
@dlibenzi any ideas?
It prolly needs this on top:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev
those lines are already at the top:
https://www.kaggle.com/pytorchlightning/pytorch-on-tpu-with-pytorch-lightning

ah... yes. good catch.
know of something more general that we can check? i assume the only two options are kaggle and colab?
@lezwon want to find an environment variable we can check to know if on kaggle and submit a PR?
Honestly, pytorch does not like fork because of CUDA, but I would make that the default, with ability to change via some environment variable in cases someone have issues.
on GCP it would still be fork?
when would it not be fork with TPUs?
Fork is an issue with pytorch/CUDA mostly.
But for safety, I would just add a Kaggle check as well in your code, and leave spawn as default.
Fork also helps Colab and Kaggle because, being them low memory VMs, one can reduce the memory consumption by creating the model (on default pytorch/cpu) at global scope, and then doing to(xla_device) from within the xmp.spawn() target functions.
This avoids creating pytorch/cpu models in each of the processes (one per core).
You can see a few tricks to fit models on Colab here:
https://colab.research.google.com/drive/1IvCxIg-Q_DlI7UNJuajpl4UZXNiW5jMg
Like create model at global scope, and serialize the to(xla_device) calls to avoid all 8 processes rushing into allocation host memory at the same time.
I also have this issue. if I use GPU, the model is training normally, but when I try to TPU, this happens.
EDIT: Having analyzed the issue is about the RAM crashing.
I believe this has to do with XLA using up RAM. I constantly use up all my RAM, which causes the
SIGKILL error. If you take a look at this: https://github.com/pytorch/xla/issues/1280 --- reference Kaggle discussions
``` training on 8 TPU cores
INIT TPU local core: 0, global rank: 0
INIT TPU local core: 4, global rank: 4
INIT TPU local core: 6, global rank: 6
INIT TPU local core: 3, global rank: 3
INIT TPU local core: 7, global rank: 7
INIT TPU local core: 5, global rank: 5
INIT TPU local core: 2, global rank: 2
INIT TPU local core: 1, global rank: 1
Validation sanity check:
0/? [00:00, ?it/s]
Exception in device=TPU:6: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
False)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
return self.next()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
xm.mark_step()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Exception in device=TPU:1: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
False)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
return self.next()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
xm.mark_step()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
Exception Traceback (most recent call last)
1 model = hatefull_memesCL()
2 if __name__ == '__main__':
----> 3 trainer.fit(model)
3 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115
Exception: process 6 terminated with exit code 17
Hmm, this is something different:
Invalid argument: Computation requires more parameters (732) than supported (limit 236).
We have seen that a few time but I keep forgetting what the root cause was.
It's a misconfiguration of the TPU service, but I do not remember how it can get in that state.
@dlibenzi it is interesting issue, i will let you know if i find the bug