I wrap the ``BertModel'' as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until 'out-of-memory'. Some key codes are as following! Every 'self.bert_model.get_bert_feature()' executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the 'class BertEmbeddings.forward()'. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!
class BertModel(PreTrainedBertModel):
def __init__(self, config):
super(BertModel, self).__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config)
self.apply(self.init_bert_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
#logger.info('bert forward')
if attention_mask is None:
attention_mask = torch.ones_like(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output,
extended_attention_mask,
output_all_encoded_layers=output_all_encoded_layers)
return encoded_layers
class Bert_Instance(object):
def __init__(self, vocab_file, bert_model_path, device):
#tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
self.tokenizer = BertTokenizer(vocab_file)
self.model = BertModel.from_pretrained(bert_model_path)
self.device = device
print ('bert_device=', self.device)
self.model.to(self.device)
self.model.eval()
for para in self.model.parameters():
para.requires_grad = False
def get_feature(self, text_list, max_seq_length=50, layer=-1):
'''
Args:
text_list is a list to store the sentences, length is the sentence_number
Return:
(batch_size, seq_len+2, hidden_size)
'''
# a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
all_features = convert_examples_to_features(examples=text_list,
max_seq_length=max_seq_length,
tokenizer=self.tokenizer)
all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
all_encoder_layers = self.model(all_input_ids,
token_type_ids=None,
attention_mask=all_input_mask)
return all_encoder_layers, all_input_mask
class Bert_Model(object):
def __init__(self, device):
self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
self.device = device
self.zp_pre_cache = {}
self.zp_post_cache = {}
self.candi_np = {}
self.cache = {'zp_pre': self.zp_pre_cache,
'zp_post': self.zp_post_cache,
'candi_np': self.candi_np}
def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
if batch_id in self.cache[cache_name].keys():
#res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
res = self.cache[cache_name][batch_id]
return res
else:
res = self.bert_model.get_feature(text_list, max_seq_length, layer)
self.cache[cache_name][batch_id] = res
return res
class Experiment(object):
def __init__(self):
# load training data
with open(DIR+"data/train_data", "rb") as fin1, \
open(DIR+"data/emb","rb") as fin2:
self.train_generator = cPickle.load(fin1)
self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
# load test data
self.test_generator = DataGenerator("test", 256)
self.dev_data = self.train_generator.generate_dev_data()
self.test_data = self.test_generator.generate_data()
# declare model architecture
self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
self.bert_model = Bert_Model(BERT_DEVICE)
this_lr = 0.003
self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
self.dropout = nnargs["dropout"]
def forward_step(self, data, mode, dropout=0.0):
zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)
batch_id = mode + '_' + str(batch_id)
zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
.....
Maybe use the torch.no_grad() context-manager which is the recommended way to perform inference with PyTorch now?
See https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad
Closing this. Feel free to re-open if the issue is still there.
Hey there, I also have some memory leak problem when using the BertModel to produce embeddings to be used as features later on.
I basically use the implementation as in the usage example.
self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
self.model = trafo.BertModel.from_pretrained('bert-base-multilingual-cased')
self.model.eval()
...
def encode_text(self, text: str) -> np.ndarray:
to_tokenize = f"[CLS] {text} [SEP]"
tokenized_text = self.tokenizer.tokenize(to_tokenize)
tokenized_text = tokenized_text[0:500]
# Convert token to vocabulary indices
indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
with torch.no_grad():
tokens_tensor = torch.tensor([indexed_tokens]).data
outputs = self.model(tokens_tensor)
return outputs
I realized that if I comment out the line outputs = self.model(tokens_tensor) and just return some random numpy array as output, I have not increasing memory problem. So it seems to be calling the model with the tensor that increases the memory.
Further, if I use the 'bert-base-uncased' model, the memory stays the same as well. It only happens with the multi models.
I used this method in a flask server application and made REST requests to it.
It's useful your assertion that it occurs _only_ when using _multi-lingual_ BERT model. Can you try to use bert-base-multilingual-uncased in order to do a comparison between these two? Perhaps there is a _performance bug_ in the multi-lingual setting.
Hey there, I also have some memory leak problem when using the BertModel to produce embeddings to be used as features later on.
I basically use the implementation as in the usage example.self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased') self.model = trafo.BertModel.from_pretrained('bert-base-multilingual-cased') self.model.eval() ... def encode_text(self, text: str) -> np.ndarray: to_tokenize = f"[CLS] {text} [SEP]" tokenized_text = self.tokenizer.tokenize(to_tokenize) tokenized_text = tokenized_text[0:500] # Convert token to vocabulary indices indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text) with torch.no_grad(): tokens_tensor = torch.tensor([indexed_tokens]).data outputs = self.model(tokens_tensor) return outputsI realized that if I comment out the line
outputs = self.model(tokens_tensor)and just return some random numpy array as output, I have not increasing memory problem. So it seems to be calling the model with the tensor that increases the memory.
Further, if I use the 'bert-base-uncased' model, the memory stays the same as well. It only happens with the multi models.I used this method in a flask server application and made REST requests to it.
So I tried it with bert-base-multilingual-uncased as well and it is the same behavior.
I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer's output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?
I've just seen that you're using PyTorch 0.4.0! What an oldest version you're using :D can you try to install the latest version of PyTorch (1.3.1) through pip install --upgrade torch and give us feedback? And please, if you can, update also the version of Transformers to the last (2.2.2) through pip install --upgrade transformers.
So I tried it with
bert-base-multilingual-uncasedas well and it is the same behavior.
I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer's output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?
Hey there, I'm using the newest pytorch and transformers. You are probably mistaking this because of the first comment of this thread (by zhangjcqq) but that was not mine. I just hijacked this thread because it seemed to be the same problem I now have and there was no solution here.
Hey there, I'm using the newest pytorch and transformers. You are probably mistaking this because of the first comment of this thread (by zhangjcqq) but that was not mine. I just hijacked this thread because it seemed to be the same problem I now have and there was no solution here.
So you have tried out to upgrade PyTorch to 1.3.1 as suggested in my last comment, but there is the same error? If no, specify your environment and a piece of code in order to reproduce the bug.
I have the newest version of pytorch and transformers, yes.
I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.
Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.
Thanks already for your help, I'm off to Christmas vacations for now and will have a look at the issue in January again. I'll see if memory usage increases by then.
flask
I miss in the same problems
but without flask, it works
I have the newest version of pytorch and transformers, yes.
I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.
Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.
Thanks already for your help, I'm off to Christmas vacations for now and will have a look at the issue in January again. I'll see if memory usage increases by then.
I have similar problems too. The memory usage gradually grows from 1xxxM to 3xxxM. @RomanTeucher @zhangjcqq did you manage to solve the issue?
@amjltc295 Did you find any solution to above issue?
@amjltc295 Did you find any solution to above issue?
when i run flask by:
threaded=False
it works
@amjltc295 Did you find any solution to above issue?
It seems that any python process takes up more and more RAM over time. A co-worker of mine had issues as well but with some other python project. We have our applications in docker containers that are limited in RAM, so they all run at 100% after some time.
Anyways, the applications still works as it is supposed to be, so we did not put further research into that.