Transformers: seems meet the GPU memory leak problem

Created on 16 Jan 2019 · 14Comments · Source: huggingface/transformers

I wrap the ``BertModel'' as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until 'out-of-memory'. Some key codes are as following! Every 'self.bert_model.get_bert_feature()' executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the 'class BertEmbeddings.forward()'. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!

class BertModel(PreTrainedBertModel):
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
        #logger.info('bert forward')
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        return encoded_layers

class Bert_Instance(object):
    def __init__(self, vocab_file, bert_model_path, device):
        #tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

        self.tokenizer = BertTokenizer(vocab_file)
        self.model = BertModel.from_pretrained(bert_model_path)
        self.device = device
        print ('bert_device=', self.device)
        self.model.to(self.device)
        self.model.eval()

        for para in self.model.parameters():
            para.requires_grad = False

    def get_feature(self, text_list, max_seq_length=50, layer=-1):
        '''
        Args:
            text_list is a list to store the sentences, length is the sentence_number
        Return:
            (batch_size, seq_len+2, hidden_size)
        '''
        # a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
        all_features = convert_examples_to_features(examples=text_list,
                                                    max_seq_length=max_seq_length,
                                                    tokenizer=self.tokenizer)

        all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
        all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)

        all_encoder_layers = self.model(all_input_ids,
                                        token_type_ids=None,
                                        attention_mask=all_input_mask)
       return all_encoder_layers, all_input_mask


class Bert_Model(object):
    def __init__(self, device):
        self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
        self.device = device
        self.zp_pre_cache = {}
        self.zp_post_cache = {}
        self.candi_np = {}
        self.cache = {'zp_pre': self.zp_pre_cache,
                      'zp_post': self.zp_post_cache,
                      'candi_np': self.candi_np}

    def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
        if batch_id in self.cache[cache_name].keys():
            #res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
            res = self.cache[cache_name][batch_id]
            return res
        else:
            res = self.bert_model.get_feature(text_list, max_seq_length, layer)
            self.cache[cache_name][batch_id] = res
            return res

class Experiment(object):
    def __init__(self):
        # load training data   
        with open(DIR+"data/train_data", "rb") as fin1, \
             open(DIR+"data/emb","rb") as fin2:
            self.train_generator = cPickle.load(fin1)
            self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
        # load test data
        self.test_generator = DataGenerator("test", 256)
        self.dev_data = self.train_generator.generate_dev_data()
        self.test_data = self.test_generator.generate_data()

        # declare model architecture
        self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
        self.bert_model = Bert_Model(BERT_DEVICE)

        this_lr = 0.003
        self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
        self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
        self.dropout = nnargs["dropout"]


 def forward_step(self, data, mode, dropout=0.0):
        zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)

        batch_id = mode + '_' + str(batch_id)
        zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
        zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
        candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
        .....

Source

zhangjcqq

All 14 comments

Maybe use the torch.no_grad() context-manager which is the recommended way to perform inference with PyTorch now?
See https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad

thomwolf on 16 Jan 2019

👍1

Closing this. Feel free to re-open if the issue is still there.

thomwolf on 23 Jan 2019

Hey there, I also have some memory leak problem when using the BertModel to produce embeddings to be used as features later on.
I basically use the implementation as in the usage example.

self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
self.model = trafo.BertModel.from_pretrained('bert-base-multilingual-cased')
self.model.eval()

...

def encode_text(self, text: str) -> np.ndarray:
    to_tokenize = f"[CLS] {text} [SEP]"
        tokenized_text = self.tokenizer.tokenize(to_tokenize)
        tokenized_text = tokenized_text[0:500]
        # Convert token to vocabulary indices
        indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
        with torch.no_grad():
            tokens_tensor = torch.tensor([indexed_tokens]).data
            outputs = self.model(tokens_tensor)
            return outputs

I realized that if I comment out the line outputs = self.model(tokens_tensor) and just return some random numpy array as output, I have not increasing memory problem. So it seems to be calling the model with the tensor that increases the memory.
Further, if I use the 'bert-base-uncased' model, the memory stays the same as well. It only happens with the multi models.

I used this method in a flask server application and made REST requests to it.

RomanTeucher on 18 Dec 2019

It's useful your assertion that it occurs _only_ when using _multi-lingual_ BERT model. Can you try to use bert-base-multilingual-uncased in order to do a comparison between these two? Perhaps there is a _performance bug_ in the multi-lingual setting.

Hey there, I also have some memory leak problem when using the BertModel to produce embeddings to be used as features later on.
I basically use the implementation as in the usage example.
self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
self.model = trafo.BertModel.from_pretrained('bert-base-multilingual-cased')
self.model.eval()

...

def encode_text(self, text: str) -> np.ndarray:
  to_tokenize = f"[CLS] {text} [SEP]"
        tokenized_text = self.tokenizer.tokenize(to_tokenize)
        tokenized_text = tokenized_text[0:500]
        # Convert token to vocabulary indices
        indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
        with torch.no_grad():
            tokens_tensor = torch.tensor([indexed_tokens]).data
            outputs = self.model(tokens_tensor)
            return outputs
I realized that if I comment out the line outputs = self.model(tokens_tensor) and just return some random numpy array as output, I have not increasing memory problem. So it seems to be calling the model with the tensor that increases the memory.
Further, if I use the 'bert-base-uncased' model, the memory stays the same as well. It only happens with the multi models.

I used this method in a flask server application and made REST requests to it.

TheEdoardo93 on 18 Dec 2019

So I tried it with bert-base-multilingual-uncased as well and it is the same behavior.
I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer's output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?

RomanTeucher on 19 Dec 2019

👀1

I've just seen that you're using PyTorch 0.4.0! What an oldest version you're using :D can you try to install the latest version of PyTorch (1.3.1) through pip install --upgrade torch and give us feedback? And please, if you can, update also the version of Transformers to the last (2.2.2) through pip install --upgrade transformers.

So I tried it with bert-base-multilingual-uncased as well and it is the same behavior.
I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer's output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?

TheEdoardo93 on 19 Dec 2019

Hey there, I'm using the newest pytorch and transformers. You are probably mistaking this because of the first comment of this thread (by zhangjcqq) but that was not mine. I just hijacked this thread because it seemed to be the same problem I now have and there was no solution here.

RomanTeucher on 19 Dec 2019

Hey there, I'm using the newest pytorch and transformers. You are probably mistaking this because of the first comment of this thread (by zhangjcqq) but that was not mine. I just hijacked this thread because it seemed to be the same problem I now have and there was no solution here.

So you have tried out to upgrade PyTorch to 1.3.1 as suggested in my last comment, but there is the same error? If no, specify your environment and a piece of code in order to reproduce the bug.

TheEdoardo93 on 19 Dec 2019

I have the newest version of pytorch and transformers, yes.

I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.

Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.

Thanks already for your help, I'm off to Christmas vacations for now and will have a look at the issue in January again. I'll see if memory usage increases by then.

RomanTeucher on 20 Dec 2019

👀1

flask

I miss in the same problems
but without flask, it works

LowinLi on 5 Jun 2020

I have the newest version of pytorch and transformers, yes.

I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.

Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.

Thanks already for your help, I'm off to Christmas vacations for now and will have a look at the issue in January again. I'll see if memory usage increases by then.

I have similar problems too. The memory usage gradually grows from 1xxxM to 3xxxM. @RomanTeucher @zhangjcqq did you manage to solve the issue?

amjltc295 on 13 Jul 2020

@amjltc295 Did you find any solution to above issue?

aayagar001 on 16 Sep 2020

@amjltc295 Did you find any solution to above issue?
when i run flask by:

threaded=False

it works

LowinLi on 16 Sep 2020

@amjltc295 Did you find any solution to above issue?

It seems that any python process takes up more and more RAM over time. A co-worker of mine had issues as well but with some other python project. We have our applications in docker containers that are limited in RAM, so they all run at 100% after some time.
Anyways, the applications still works as it is supposed to be, so we did not put further research into that.

RomanTeucher on 17 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings