Flair: RuntimeError: CUDA out of memory. Tried to allocate 352.00 MiB (GPU 0; 15.90 GiB total capacity; 14.28 GiB already allocated; 259.88 MiB free; 14.94 GiB reserved in total by PyTorch) in bert-base-multilingual-cased

Created on 11 Feb 2020  路  8Comments  路  Source: flairNLP/flair

I appreciate your help. The below error generated only when I try to add the BERT 'bert-base-multilingual-cased' embedding however when I remove it my code working fine. Please assist.

Code

from flair.data import Corpus
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings,FlairEmbeddings
from typing import List
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import BertEmbeddings

columns = {0: 'text', 1: 'ner'}

data_folder = '/content/gdrive/My Drive/resources/tasks/conll_03'

corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file='train.txt',
test_file='test.txt',
dev_file='dev.txt')

tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

embedding_types: List[TokenEmbeddings] = [

WordEmbeddings('ar'),
**BertEmbeddings('bert-base-multilingual-cased'),**

PooledFlairEmbeddings('ar-forward', pooling='max'),


PooledFlairEmbeddings('ar-backward' , pooling='max'),

]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)

from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

import pickle
trainer.train('/content/gdrive/My Drive/resources/taggers/example-ner',
train_with_dev=True,
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
checkpoint=True)

Environment (please complete the following information):

  • colab GPU
    -pip install --upgrade flair

Error
2020-02-11 11:43:30,835 ----------------------------------------------------------------------------------------------------
2020-02-11 11:43:30,837 Corpus: "Corpus: 1328 train + 710 dev + 605 test sentences"
2020-02-11 11:43:30,839 ----------------------------------------------------------------------------------------------------
2020-02-11 11:43:30,841 Parameters:
2020-02-11 11:43:30,844 - learning_rate: "0.1"
2020-02-11 11:43:30,845 - mini_batch_size: "32"
2020-02-11 11:43:30,846 - patience: "3"
2020-02-11 11:43:30,848 - anneal_factor: "0.5"
2020-02-11 11:43:30,849 - max_epochs: "150"
2020-02-11 11:43:30,851 - shuffle: "True"
2020-02-11 11:43:30,852 - train_with_dev: "True"
2020-02-11 11:43:30,853 - batch_growth_annealing: "False"
2020-02-11 11:43:30,854 ----------------------------------------------------------------------------------------------------
2020-02-11 11:43:30,855 Model training base path: "/content/gdrive/My Drive/resources/taggers/example-ner"
2020-02-11 11:43:30,857 ----------------------------------------------------------------------------------------------------
2020-02-11 11:43:30,858 Device: cuda:0
2020-02-11 11:43:30,860 ----------------------------------------------------------------------------------------------------
2020-02-11 11:43:30,861 Embeddings storage mode: cpu
2020-02-11 11:43:31,008 ----------------------------------------------------------------------------------------------------
train mode resetting embeddings
train mode resetting embeddings

2020-02-11 11:43:39,175 epoch 1 - iter 6/64 - loss 28.77731991 - samples/sec: 23.53

RuntimeError Traceback (most recent call last)
in ()
68 mini_batch_size=32,
69 max_epochs=150,
---> 70 checkpoint=True)
71

14 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py in gelu(x)
133 Also see https://arxiv.org/abs/1606.08415
134 """
--> 135 return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
136
137

RuntimeError: CUDA out of memory. Tried to allocate 168.00 MiB (GPU 0; 15.90 GiB total capacity; 14.52 GiB already allocated; 91.88 MiB free; 15.11 GiB reserved in total by PyTorch)

one change the setup using --upgrade, I got the same error in different line

-pip install --upgrade flair

*error after --upgrade command *
2020-02-11 11:59:12,305 Reading data from /content/gdrive/My Drive/resources/tasks/conll_03
2020-02-11 11:59:12,307 Train: /content/gdrive/My Drive/resources/tasks/conll_03/train.txt
2020-02-11 11:59:12,308 Dev: /content/gdrive/My Drive/resources/tasks/conll_03/dev.txt
2020-02-11 11:59:12,310 Test: /content/gdrive/My Drive/resources/tasks/conll_03/test.txt
/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function

'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

RuntimeError Traceback (most recent call last)
in ()
50 embeddings=embeddings,
51 tag_dictionary=tag_dictionary,
---> 52 tag_type=tag_type)
53 # initialize trainer
54 from flair.trainers import ModelTrainer

8 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)

RuntimeError: CUDA out of memory. Tried to allocate 352.00 MiB (GPU 0; 15.90 GiB total capacity; 14.28 GiB already allocated; 259.88 MiB free; 14.94 GiB reserved in total by PyTorch)

bug wontfix

All 8 comments

Hi @alanakbik any update on this. the above problem reproduced any many cases

What is the problem? You lack of memory on GPU. You can lower batch size (chunk).

Hi @djstrong thank you for your reply.
i tried using mini_batch_size=1 as below, however, I am getting the same error.
the problem disappears when using another smaller dataset.

Any other suggestions

trainer.train('/content/gdrive/My Drive/resources/taggers/example-ner',
train_with_dev=True,
learning_rate=0.1,
mini_batch_size=1,
max_epochs=150,
checkpoint=True)

Maybe some sentences are very long in the dataset causing OOM. Number of examples should not matter because embeddings are not stored on gpu.

@djstrong Thank you.
I have a problem in my code the whole corpus read as one sentence. after adding document_separator_token='.' to corpus the code working fine. (but this for the same error generated from different code than reported here)
Much appreciated @djstrong :)

but the above solution working with the problem generated from a very big corpus (Not 'bert-base-multilingual-cased' in my code). I didn't recheck the problem with BERT 'bert-base-multilingual-cased' yet
and I guess the above solution will not work with it. will check and back

I am also facing the same issue. The corpus size is -
train : 46214 sentences
validation : 9698 sentences
test: 9619 sentences

from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List

columns = {0: 'text', 1: 'ner'}

data_folder = "Flair/ner_corpus_iob/"

corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file='train/train.txt',
test_file='test.txt',
dev_file='valid.txt',))

tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

embedding_types: List[TokenEmbeddings] = [
FlairEmbeddings('Flair/language_model_forward/best-lm.pt'),
FlairEmbeddings('Flair/language_model_backward/best-lm.pt')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('Flair/ner_model',
learning_rate=0.001,
mini_batch_size=1,
max_epochs=10,
monitor_train=True,
monitor_test=True,
checkpoint=True,
train_with_dev=True)

By setting the mini_batch_size = 1 I am able to start the training, but the model is training very slow, and the results are not good - was able to get ~45 for f1 score.

Training on AWS, ml.p3.2xlarge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings