I'm trying to train BART (with transformers library) on Colab TPU. I followed the TPU documentation of Pytorch Lightning, but before the training can start, I receive the following error :
Exception: process 0 terminated with signal SIGKILL
I'm using the official example for text summarization on transformers library : https://github.com/huggingface/transformers/blob/master/examples/summarization/bart/finetune.py
Here is the full stack trace :
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json from cache at /root/.cache/torch/transformers/9236d197566a7f1be2b2151f5afcc5a8e17f31e1e23c52f3cdf2340019986e78.3de31bca490b759d81268bc95fdc9ab61f970ee46716ae8b25a1f4f1aba766e7
INFO:transformers.configuration_utils:Model config ElectraConfig {
"_num_labels": 2,
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"bad_words_ids": null,
"bos_token_id": null,
"decoder_start_token_id": null,
"do_sample": false,
"early_stopping": false,
"embedding_size": 768,
"eos_token_id": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"min_length": 0,
"model_type": "electra",
"no_repeat_ngram_size": 0,
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"prefix": null,
"pruned_heads": {},
"repetition_penalty": 1.0,
"task_specific_params": null,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 30522
}
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json from cache at /root/.cache/torch/transformers/9236d197566a7f1be2b2151f5afcc5a8e17f31e1e23c52f3cdf2340019986e78.3de31bca490b759d81268bc95fdc9ab61f970ee46716ae8b25a1f4f1aba766e7
INFO:transformers.configuration_utils:Model config ElectraConfig {
"_num_labels": 2,
"architectures": [
"ElectraForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"bad_words_ids": null,
"bos_token_id": null,
"decoder_start_token_id": null,
"do_sample": false,
"early_stopping": false,
"embedding_size": 768,
"eos_token_id": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"min_length": 0,
"model_type": "electra",
"no_repeat_ngram_size": 0,
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"prefix": null,
"pruned_heads": {},
"repetition_penalty": 1.0,
"task_specific_params": null,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 30522
}
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt from cache at /root/.cache/torch/transformers/ff085885d4c95651587af553adadd34a26de8a663f2cef709635b48b3bed2bbd.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/pytorch_model.bin from cache at /root/.cache/torch/transformers/3c8e97e5021532563898ceb491dbfbc068ab4cb9eaa31f555990b9993e3228b4.b7514d01ce5acfe02313470cce3175018852a5e8cbcb8784268ab87dc21daf4c
INFO:transformers.modeling_utils:Weights from pretrained model not used in ElectraModel: ['electra.embeddings_project.weight', 'electra.embeddings_project.bias']
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
warnings.warn(*args, **kwargs)
=> Dataset loaded from cache (./cnndm/ElectraTokenizer_cache_train.pt)
=> Dataset loaded from cache (./cnndm/ElectraTokenizer_cache_val.pt)
INFO:lightning:training on 8 TPU cores
=> Dataset loaded from cache (./cnndm/ElectraTokenizer_cache_test.pt)
INFO:lightning:INIT TPU local core: 0, global rank: 0
INFO:lightning:INIT TPU local core: 2, global rank: 2
INFO:lightning:INIT TPU local core: 3, global rank: 3
INFO:lightning:INIT TPU local core: 1, global rank: 1
INFO:lightning:INIT TPU local core: 4, global rank: 4
INFO:lightning:INIT TPU local core: 5, global rank: 5
INFO:lightning:INIT TPU local core: 6, global rank: 6
INFO:lightning:INIT TPU local core: 7, global rank: 7
INFO:lightning:
| Name | Type | Params
------------------------------------------------------------------------------------
0 | model | ElectraModel | 108 M
1 | model.embeddings | ElectraEmbeddings | 23 M
2 | model.embeddings.word_embeddings | Embedding | 23 M
3 | model.embeddings.position_embeddings | Embedding | 393 K
4 | model.embeddings.token_type_embeddings | Embedding | 1 K
5 | model.embeddings.LayerNorm | LayerNorm | 1 K
6 | model.embeddings.dropout | Dropout | 0
7 | model.encoder | BertEncoder | 85 M
8 | model.encoder.layer | ModuleList | 85 M
9 | model.encoder.layer.0 | BertLayer | 7 M
10 | model.encoder.layer.0.attention | BertAttention | 2 M
11 | model.encoder.layer.0.attention.self | BertSelfAttention | 1 M
12 | model.encoder.layer.0.attention.self.query | Linear | 590 K
13 | model.encoder.layer.0.attention.self.key | Linear | 590 K
14 | model.encoder.layer.0.attention.self.value | Linear | 590 K
15 | model.encoder.layer.0.attention.self.dropout | Dropout | 0
16 | model.encoder.layer.0.attention.output | BertSelfOutput | 592 K
17 | model.encoder.layer.0.attention.output.dense | Linear | 590 K
18 | model.encoder.layer.0.attention.output.LayerNorm | LayerNorm | 1 K
19 | model.encoder.layer.0.attention.output.dropout | Dropout | 0
20 | model.encoder.layer.0.intermediate | BertIntermediate | 2 M
21 | model.encoder.layer.0.intermediate.dense | Linear | 2 M
22 | model.encoder.layer.0.output | BertOutput | 2 M
23 | model.encoder.layer.0.output.dense | Linear | 2 M
24 | model.encoder.layer.0.output.LayerNorm | LayerNorm | 1 K
25 | model.encoder.layer.0.output.dropout | Dropout | 0
26 | model.encoder.layer.1 | BertLayer | 7 M
27 | model.encoder.layer.1.attention | BertAttention | 2 M
28 | model.encoder.layer.1.attention.self | BertSelfAttention | 1 M
29 | model.encoder.layer.1.attention.self.query | Linear | 590 K
30 | model.encoder.layer.1.attention.self.key | Linear | 590 K
31 | model.encoder.layer.1.attention.self.value | Linear | 590 K
32 | model.encoder.layer.1.attention.self.dropout | Dropout | 0
33 | model.encoder.layer.1.attention.output | BertSelfOutput | 592 K
34 | model.encoder.layer.1.attention.output.dense | Linear | 590 K
35 | model.encoder.layer.1.attention.output.LayerNorm | LayerNorm | 1 K
36 | model.encoder.layer.1.attention.output.dropout | Dropout | 0
37 | model.encoder.layer.1.intermediate | BertIntermediate | 2 M
38 | model.encoder.layer.1.intermediate.dense | Linear | 2 M
39 | model.encoder.layer.1.output | BertOutput | 2 M
40 | model.encoder.layer.1.output.dense | Linear | 2 M
41 | model.encoder.layer.1.output.LayerNorm | LayerNorm | 1 K
42 | model.encoder.layer.1.output.dropout | Dropout | 0
43 | model.encoder.layer.2 | BertLayer | 7 M
44 | model.encoder.layer.2.attention | BertAttention | 2 M
45 | model.encoder.layer.2.attention.self | BertSelfAttention | 1 M
46 | model.encoder.layer.2.attention.self.query | Linear | 590 K
47 | model.encoder.layer.2.attention.self.key | Linear | 590 K
48 | model.encoder.layer.2.attention.self.value | Linear | 590 K
49 | model.encoder.layer.2.attention.self.dropout | Dropout | 0
50 | model.encoder.layer.2.attention.output | BertSelfOutput | 592 K
51 | model.encoder.layer.2.attention.output.dense | Linear | 590 K
52 | model.encoder.layer.2.attention.output.LayerNorm | LayerNorm | 1 K
53 | model.encoder.layer.2.attention.output.dropout | Dropout | 0
54 | model.encoder.layer.2.intermediate | BertIntermediate | 2 M
55 | model.encoder.layer.2.intermediate.dense | Linear | 2 M
56 | model.encoder.layer.2.output | BertOutput | 2 M
57 | model.encoder.layer.2.output.dense | Linear | 2 M
58 | model.encoder.layer.2.output.LayerNorm | LayerNorm | 1 K
59 | model.encoder.layer.2.output.dropout | Dropout | 0
60 | model.encoder.layer.3 | BertLayer | 7 M
61 | model.encoder.layer.3.attention | BertAttention | 2 M
62 | model.encoder.layer.3.attention.self | BertSelfAttention | 1 M
63 | model.encoder.layer.3.attention.self.query | Linear | 590 K
64 | model.encoder.layer.3.attention.self.key | Linear | 590 K
65 | model.encoder.layer.3.attention.self.value | Linear | 590 K
66 | model.encoder.layer.3.attention.self.dropout | Dropout | 0
67 | model.encoder.layer.3.attention.output | BertSelfOutput | 592 K
68 | model.encoder.layer.3.attention.output.dense | Linear | 590 K
69 | model.encoder.layer.3.attention.output.LayerNorm | LayerNorm | 1 K
70 | model.encoder.layer.3.attention.output.dropout | Dropout | 0
71 | model.encoder.layer.3.intermediate | BertIntermediate | 2 M
72 | model.encoder.layer.3.intermediate.dense | Linear | 2 M
73 | model.encoder.layer.3.output | BertOutput | 2 M
74 | model.encoder.layer.3.output.dense | Linear | 2 M
75 | model.encoder.layer.3.output.LayerNorm | LayerNorm | 1 K
76 | model.encoder.layer.3.output.dropout | Dropout | 0
77 | model.encoder.layer.4 | BertLayer | 7 M
78 | model.encoder.layer.4.attention | BertAttention | 2 M
79 | model.encoder.layer.4.attention.self | BertSelfAttention | 1 M
80 | model.encoder.layer.4.attention.self.query | Linear | 590 K
81 | model.encoder.layer.4.attention.self.key | Linear | 590 K
82 | model.encoder.layer.4.attention.self.value | Linear | 590 K
83 | model.encoder.layer.4.attention.self.dropout | Dropout | 0
84 | model.encoder.layer.4.attention.output | BertSelfOutput | 592 K
85 | model.encoder.layer.4.attention.output.dense | Linear | 590 K
86 | model.encoder.layer.4.attention.output.LayerNorm | LayerNorm | 1 K
87 | model.encoder.layer.4.attention.output.dropout | Dropout | 0
88 | model.encoder.layer.4.intermediate | BertIntermediate | 2 M
89 | model.encoder.layer.4.intermediate.dense | Linear | 2 M
90 | model.encoder.layer.4.output | BertOutput | 2 M
91 | model.encoder.layer.4.output.dense | Linear | 2 M
92 | model.encoder.layer.4.output.LayerNorm | LayerNorm | 1 K
93 | model.encoder.layer.4.output.dropout | Dropout | 0
94 | model.encoder.layer.5 | BertLayer | 7 M
95 | model.encoder.layer.5.attention | BertAttention | 2 M
96 | model.encoder.layer.5.attention.self | BertSelfAttention | 1 M
97 | model.encoder.layer.5.attention.self.query | Linear | 590 K
98 | model.encoder.layer.5.attention.self.key | Linear | 590 K
99 | model.encoder.layer.5.attention.self.value | Linear | 590 K
100 | model.encoder.layer.5.attention.self.dropout | Dropout | 0
101 | model.encoder.layer.5.attention.output | BertSelfOutput | 592 K
102 | model.encoder.layer.5.attention.output.dense | Linear | 590 K
103 | model.encoder.layer.5.attention.output.LayerNorm | LayerNorm | 1 K
104 | model.encoder.layer.5.attention.output.dropout | Dropout | 0
105 | model.encoder.layer.5.intermediate | BertIntermediate | 2 M
106 | model.encoder.layer.5.intermediate.dense | Linear | 2 M
107 | model.encoder.layer.5.output | BertOutput | 2 M
108 | model.encoder.layer.5.output.dense | Linear | 2 M
109 | model.encoder.layer.5.output.LayerNorm | LayerNorm | 1 K
110 | model.encoder.layer.5.output.dropout | Dropout | 0
111 | model.encoder.layer.6 | BertLayer | 7 M
112 | model.encoder.layer.6.attention | BertAttention | 2 M
113 | model.encoder.layer.6.attention.self | BertSelfAttention | 1 M
114 | model.encoder.layer.6.attention.self.query | Linear | 590 K
115 | model.encoder.layer.6.attention.self.key | Linear | 590 K
116 | model.encoder.layer.6.attention.self.value | Linear | 590 K
117 | model.encoder.layer.6.attention.self.dropout | Dropout | 0
118 | model.encoder.layer.6.attention.output | BertSelfOutput | 592 K
119 | model.encoder.layer.6.attention.output.dense | Linear | 590 K
120 | model.encoder.layer.6.attention.output.LayerNorm | LayerNorm | 1 K
121 | model.encoder.layer.6.attention.output.dropout | Dropout | 0
122 | model.encoder.layer.6.intermediate | BertIntermediate | 2 M
123 | model.encoder.layer.6.intermediate.dense | Linear | 2 M
124 | model.encoder.layer.6.output | BertOutput | 2 M
125 | model.encoder.layer.6.output.dense | Linear | 2 M
126 | model.encoder.layer.6.output.LayerNorm | LayerNorm | 1 K
127 | model.encoder.layer.6.output.dropout | Dropout | 0
128 | model.encoder.layer.7 | BertLayer | 7 M
129 | model.encoder.layer.7.attention | BertAttention | 2 M
130 | model.encoder.layer.7.attention.self | BertSelfAttention | 1 M
131 | model.encoder.layer.7.attention.self.query | Linear | 590 K
132 | model.encoder.layer.7.attention.self.key | Linear | 590 K
133 | model.encoder.layer.7.attention.self.value | Linear | 590 K
134 | model.encoder.layer.7.attention.self.dropout | Dropout | 0
135 | model.encoder.layer.7.attention.output | BertSelfOutput | 592 K
136 | model.encoder.layer.7.attention.output.dense | Linear | 590 K
137 | model.encoder.layer.7.attention.output.LayerNorm | LayerNorm | 1 K
138 | model.encoder.layer.7.attention.output.dropout | Dropout | 0
139 | model.encoder.layer.7.intermediate | BertIntermediate | 2 M
140 | model.encoder.layer.7.intermediate.dense | Linear | 2 M
141 | model.encoder.layer.7.output | BertOutput | 2 M
142 | model.encoder.layer.7.output.dense | Linear | 2 M
143 | model.encoder.layer.7.output.LayerNorm | LayerNorm | 1 K
144 | model.encoder.layer.7.output.dropout | Dropout | 0
145 | model.encoder.layer.8 | BertLayer | 7 M
146 | model.encoder.layer.8.attention | BertAttention | 2 M
147 | model.encoder.layer.8.attention.self | BertSelfAttention | 1 M
148 | model.encoder.layer.8.attention.self.query | Linear | 590 K
149 | model.encoder.layer.8.attention.self.key | Linear | 590 K
150 | model.encoder.layer.8.attention.self.value | Linear | 590 K
151 | model.encoder.layer.8.attention.self.dropout | Dropout | 0
152 | model.encoder.layer.8.attention.output | BertSelfOutput | 592 K
153 | model.encoder.layer.8.attention.output.dense | Linear | 590 K
154 | model.encoder.layer.8.attention.output.LayerNorm | LayerNorm | 1 K
155 | model.encoder.layer.8.attention.output.dropout | Dropout | 0
156 | model.encoder.layer.8.intermediate | BertIntermediate | 2 M
157 | model.encoder.layer.8.intermediate.dense | Linear | 2 M
158 | model.encoder.layer.8.output | BertOutput | 2 M
159 | model.encoder.layer.8.output.dense | Linear | 2 M
160 | model.encoder.layer.8.output.LayerNorm | LayerNorm | 1 K
161 | model.encoder.layer.8.output.dropout | Dropout | 0
162 | model.encoder.layer.9 | BertLayer | 7 M
163 | model.encoder.layer.9.attention | BertAttention | 2 M
164 | model.encoder.layer.9.attention.self | BertSelfAttention | 1 M
165 | model.encoder.layer.9.attention.self.query | Linear | 590 K
166 | model.encoder.layer.9.attention.self.key | Linear | 590 K
167 | model.encoder.layer.9.attention.self.value | Linear | 590 K
168 | model.encoder.layer.9.attention.self.dropout | Dropout | 0
169 | model.encoder.layer.9.attention.output | BertSelfOutput | 592 K
170 | model.encoder.layer.9.attention.output.dense | Linear | 590 K
171 | model.encoder.layer.9.attention.output.LayerNorm | LayerNorm | 1 K
172 | model.encoder.layer.9.attention.output.dropout | Dropout | 0
173 | model.encoder.layer.9.intermediate | BertIntermediate | 2 M
174 | model.encoder.layer.9.intermediate.dense | Linear | 2 M
175 | model.encoder.layer.9.output | BertOutput | 2 M
176 | model.encoder.layer.9.output.dense | Linear | 2 M
177 | model.encoder.layer.9.output.LayerNorm | LayerNorm | 1 K
178 | model.encoder.layer.9.output.dropout | Dropout | 0
179 | model.encoder.layer.10 | BertLayer | 7 M
180 | model.encoder.layer.10.attention | BertAttention | 2 M
181 | model.encoder.layer.10.attention.self | BertSelfAttention | 1 M
182 | model.encoder.layer.10.attention.self.query | Linear | 590 K
183 | model.encoder.layer.10.attention.self.key | Linear | 590 K
184 | model.encoder.layer.10.attention.self.value | Linear | 590 K
185 | model.encoder.layer.10.attention.self.dropout | Dropout | 0
186 | model.encoder.layer.10.attention.output | BertSelfOutput | 592 K
187 | model.encoder.layer.10.attention.output.dense | Linear | 590 K
188 | model.encoder.layer.10.attention.output.LayerNorm | LayerNorm | 1 K
189 | model.encoder.layer.10.attention.output.dropout | Dropout | 0
190 | model.encoder.layer.10.intermediate | BertIntermediate | 2 M
191 | model.encoder.layer.10.intermediate.dense | Linear | 2 M
192 | model.encoder.layer.10.output | BertOutput | 2 M
193 | model.encoder.layer.10.output.dense | Linear | 2 M
194 | model.encoder.layer.10.output.LayerNorm | LayerNorm | 1 K
195 | model.encoder.layer.10.output.dropout | Dropout | 0
196 | model.encoder.layer.11 | BertLayer | 7 M
197 | model.encoder.layer.11.attention | BertAttention | 2 M
198 | model.encoder.layer.11.attention.self | BertSelfAttention | 1 M
199 | model.encoder.layer.11.attention.self.query | Linear | 590 K
200 | model.encoder.layer.11.attention.self.key | Linear | 590 K
201 | model.encoder.layer.11.attention.self.value | Linear | 590 K
202 | model.encoder.layer.11.attention.self.dropout | Dropout | 0
203 | model.encoder.layer.11.attention.output | BertSelfOutput | 592 K
204 | model.encoder.layer.11.attention.output.dense | Linear | 590 K
205 | model.encoder.layer.11.attention.output.LayerNorm | LayerNorm | 1 K
206 | model.encoder.layer.11.attention.output.dropout | Dropout | 0
207 | model.encoder.layer.11.intermediate | BertIntermediate | 2 M
208 | model.encoder.layer.11.intermediate.dense | Linear | 2 M
209 | model.encoder.layer.11.output | BertOutput | 2 M
210 | model.encoder.layer.11.output.dense | Linear | 2 M
211 | model.encoder.layer.11.output.LayerNorm | LayerNorm | 1 K
212 | model.encoder.layer.11.output.dropout | Dropout | 0
213 | dropout | Dropout | 0
214 | classifier | Linear | 769
Validation sanity check: 0%
0/5 [00:00<?, ?it/s]
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-22-ab45ae238b6a> in <module>()
41 args = parser.parse_args(cmd_args)
42
---> 43 main(args)
5 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
106 raise Exception(
107 "process %d terminated with signal %s" %
--> 108 (error_index, name)
109 )
110 else:
Exception: process 0 terminated with signal SIGKILL
* CUDA:
- GPU:
- available: False
- version: None
* Packages:
- numpy: 1.18.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0a0+b889e0d
- pytorch-lightning: 0.7.3
- tensorboard: 2.2.1
- tqdm: 4.38.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.6.9
- version: #1 SMP Wed Feb 19 05:26:34 PST 2020
I saw the issue #996 but I don't think it's the issue because my RAM does not appear to be full :

you need the memory crash note at the top of the colab no?
@srush @dlibenzi
That message is almost certainly the Linux kernel OOM killer getting rid of the fattiest process.
You can take a look at this thread where I posted some notebooks to try to overcomes the limited RAM in the Colab/Kaggle notebooks:
https://github.com/pytorch/xla/issues/1870
Unfortunately those are not what our usual way of doing things is, and what Lightning adopts ATM.
I could solve my specific problem by specifying num_worker=1 in my data loader (instead of 4)
What is different colab vs kaggle? I had this problem only with colab...
Most helpful comment
That message is almost certainly the Linux kernel OOM killer getting rid of the fattiest process.
You can take a look at this thread where I posted some notebooks to try to overcomes the limited RAM in the Colab/Kaggle notebooks:
https://github.com/pytorch/xla/issues/1870
Unfortunately those are not what our usual way of doing things is, and what Lightning adopts ATM.