Pytorch-lightning: `num_tpu_cores=8` does not work on kaggle

Created on 20 Apr 2020 · 12Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

When I try to train a model on Kaggle TPU's with num_tpu_cores set to 8, I receive an error Exception: process 2 terminated with exit code 1 . Would be great if this worked on kaggle.

To Reproduce

Steps to reproduce the behavior:

Run this notebook:
https://www.kaggle.com/lezwon/pytorch-on-tpu-with-pytorch-lightning

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-9-9251330963d1> in <module>
      3 # most basic trainer, uses good defaults (1 TPU)
      4 trainer = pl.Trainer(num_tpu_cores=8)
----> 5 trainer.fit(mnist_model)

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
    714 
    715             # train
--> 716             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    717 
    718             # load weights if not interrupted

/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 3 terminated with exit code 1

Code sample

trainer = pl.Trainer(num_tpu_cores=8, precision=16)

Expected behavior

Run the model utilizing all 8 TPU cores.

Environment

cuda:
    GPU:
    available:           False
    version:             None
packages:
    numpy:               1.18.2
    pyTorch_debug:       False
    pyTorch_version:     1.6.0a0+30e7055
    pytorch-lightning:   0.7.3
    tensorboard:         2.1.1
    tqdm:                4.42.0
system:
    OS:                  Linux
    architecture:
        64bit

    processor:           
    python:              3.6.6
    version:             #1 SMP Sat Apr 4 00:12:45 PDT 2020

Priority P0 bug / fix help wanted

Source

lezwon

All 12 comments

I think this is a kaggle problem?
@dlibenzi any ideas?

williamFalcon on 21 Apr 2020

It prolly needs this on top:

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

dlibenzi on 21 Apr 2020

those lines are already at the top:
https://www.kaggle.com/pytorchlightning/pytorch-on-tpu-with-pytorch-lightning

williamFalcon on 21 Apr 2020

I bet the issue is here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bd168819f2e6433a8b0f0d497eda785199cf65cf/pytorch_lightning/trainer/trainer.py#L762

dlibenzi on 21 Apr 2020

ah... yes. good catch.
know of something more general that we can check? i assume the only two options are kaggle and colab?

williamFalcon on 21 Apr 2020

@lezwon want to find an environment variable we can check to know if on kaggle and submit a PR?

williamFalcon on 21 Apr 2020

👍1

Honestly, pytorch does not like fork because of CUDA, but I would make that the default, with ability to change via some environment variable in cases someone have issues.

dlibenzi on 21 Apr 2020

on GCP it would still be fork?
when would it not be fork with TPUs?

williamFalcon on 21 Apr 2020

Fork is an issue with pytorch/CUDA mostly.
But for safety, I would just add a Kaggle check as well in your code, and leave spawn as default.
Fork also helps Colab and Kaggle because, being them low memory VMs, one can reduce the memory consumption by creating the model (on default pytorch/cpu) at global scope, and then doing to(xla_device) from within the xmp.spawn() target functions.
This avoids creating pytorch/cpu models in each of the processes (one per core).
You can see a few tricks to fit models on Colab here:

https://colab.research.google.com/drive/1IvCxIg-Q_DlI7UNJuajpl4UZXNiW5jMg

Like create model at global scope, and serialize the to(xla_device) calls to avoid all 8 processes rushing into allocation host memory at the same time.

dlibenzi on 21 Apr 2020

👍1

I also have this issue. if I use GPU, the model is training normally, but when I try to TPU, this happens.

EDIT: Having analyzed the issue is about the RAM crashing.

I believe this has to do with XLA using up RAM. I constantly use up all my RAM, which causes the
SIGKILL error. If you take a look at this: https://github.com/pytorch/xla/issues/1280 --- reference Kaggle discussions

``` training on 8 TPU cores
INIT TPU local core: 0, global rank: 0
INIT TPU local core: 4, global rank: 4
INIT TPU local core: 6, global rank: 6
INIT TPU local core: 3, global rank: 3
INIT TPU local core: 7, global rank: 7
INIT TPU local core: 5, global rank: 5
INIT TPU local core: 2, global rank: 2
INIT TPU local core: 1, global rank: 1

Validation sanity check:
0/? [00:00 Exception in device=TPU:6: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
False)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
return self.next()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
xm.mark_step()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Exception in device=TPU:1: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
False)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
return self.next()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
xm.mark_step()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.

0 derived errors ignored.

Exception Traceback (most recent call last)
in ()
1 model = hatefull_memesCL()
2 if __name__ == '__main__':
----> 3 trainer.fit(model)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115

Exception: process 6 terminated with exit code 17

engmubarak48 on 12 Jun 2020

Hmm, this is something different:

Invalid argument: Computation requires more parameters (732) than supported (limit 236).

We have seen that a few time but I keep forgetting what the root cause was.
It's a misconfiguration of the TPU service, but I do not remember how it can get in that state.

dlibenzi on 12 Jun 2020

@dlibenzi it is interesting issue, i will let you know if i find the bug

engmubarak48 on 12 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings