When training gpt2-large on a colab tpu, gpt2-large doesn't work
See the colab notebook: https://colab.research.google.com/drive/1An6D3wh_H4dbmlEUHYOXZYxkH6S7VKu9
This is the relevant part of the stack trace:
INFO:root:training on 8 TPU cores
2020-03-02 00:43:14.794597: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.857680: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.918609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.974498: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.031540: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.087601: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.142553: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
E0302 00:43:22.445484458 1536 server_chttp2.cc:40] {"created":"@1583109802.445465277","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1583109802.445463004","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:57271"}]}
2020-03-02 00:43:24.109498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.429623: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.712988: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.731491: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.867584: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.883436: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:25.112841: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
INFO:root:INIT TPU local core: 0, global rank: 0
2020-03-02 00:44:11.382078: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
INFO:root:INIT TPU local core: 2, global rank: 2
INFO:root:INIT TPU local core: 5, global rank: 5
2020-03-02 00:44:15.925331: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'pl.Trainer.run_pretrain_routine': Socket closed (14)
Traceback (most recent call last):
File "finetune.py", line 129, in <module>
trainer.fit(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 976, in fit
xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
(error_index, name)
Exception: process 1 terminated with signal SIGKILL
The code works when training on gpt2 (124M) but doesn't when training on gpt2-large (774M)
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.17.5
[pip3] torch==1.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect
Hi! thanks for your contribution!, great first issue!
@bkkaggle try again using the latest version.
I updated the colab notebook, the error remains but it looks like it's because pytorch/xla is loading the data to all the processes, causing an OOM. (https://github.com/pytorch/xla/issues/1280#issuecomment-548607522)
Closing
@dlibenzi fyi.
@bkkaggle maybe file a bug in xla repo?
It's likely the kernel OOM killer triggering this.
Colabs have limited memory and cores, so cannot run very large workloads.
We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.
@srush fyi
Yup, this is what I saw as well. You need enough RAM to have the model loaded 8 times.
Most helpful comment
It's likely the kernel OOM killer triggering this.
Colabs have limited memory and cores, so cannot run very large workloads.
We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.