As pretrained embeddings are very large we need to infer on bert model in the time of training.
Let's say I have a batch of data in following format,
(B, T, I) = (sentence no, words, word_index)
(B, L) = (sentence no, sentence length)
Now Is there any way to do infer on the pretrained BERT model and get an output like below,
(B, T, I, E) = (sentence no, words, word_index, embeddings)
All I see here, https://github.com/google-research/bert/blob/master/run_pretraining.py#L131
But I am not sure what are
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings
these two arguments and how to get the desired output?
do you mean the true ‘word’ embedding?
bert uses bpe tokenizer to segment an oov word to series of tokens. so, you need to prepare
if you have
(batch size, max sentence size)
then, after converting it, you will get :
(batch size, bert max sentence size)
and feed it to bert model,
you will get :
(batch size, bert max sentence size, bert model size)
as you see, the true word embedding lies across a number of bert embeddings.
in NER task, we can regard the first bert token embedding as an word embedding.
or
we could merge the bert token embeddings to an word embedding.
but, it is hard to compute via tensorflow ops.
(may be pytorch is possible?)
or
from scratch, we could train a bert model by using untokenized version.
finally, if we want to use some kind of word-based feature(say, word embedding from character convolution, part of speech embedding),
we need to extend those feature vector to align with bert vector like :
https://github.com/dsindex/etagger/blob/master/input.py#L105
I'm not sure if I got the question completely right. But, the only thing you need, to get the BERT embeddings is the sequence of words in your sentence which seems to me that you have in T.
Just remember to use model.get_sequence_output() instead of model.get_pooled_output() after creating the model using modeling.BertModel. The former, as the name suggests, returns a vector for each token in the input sequence.
use_one_hot_embeddings is the method of retrieving word embeddings out of the embedding tensor. If False, uses tf.gather otherwise, a simple tf.matmul generates requested embeddings. It seems to me that setting use_one_hot_embeddings to False is usually preferred (unless you want your model less parallel; for some reason).
token_type_ids, as I understood, is a generalization for segment_id. It's quite intuitive to add some information to words of the second sequence (e.g. answer) to make the model, somehow, aware that this word belongs to the second sequence. So, for tasks like the question answering, token_type_ids are 0 for words in the question and 1 for words in the answer. One might also imagine a task which has more than two types of token which makes token_type_ids quite handy.
@dsindex hi there!
Nice repo you have.
I saw you are working on NER. Awesome work. Once I worked with it.
I still don't get it.
What is the segment id? What should be the shape of it?
Apart from that I got my answer.
@sbmaruf
segment_id is always 1’s with length T’ for NER.
if you try to slove a problem with A and B,
(for example, MRC, sentence pair classification)
then, A sentence should have 1’s and B should have 0’s segment_id for segment embedding.
there is the pretrained segment embedding in the pretrained bert. and the size of vocab for segment embedding is 2.
Most helpful comment
I'm not sure if I got the question completely right. But, the only thing you need, to get the BERT embeddings is the sequence of words in your sentence which seems to me that you have in T.
Just remember to use
model.get_sequence_output()instead ofmodel.get_pooled_output()after creating the model usingmodeling.BertModel. The former, as the name suggests, returns a vector for each token in the input sequence.use_one_hot_embeddingsis the method of retrieving word embeddings out of the embedding tensor. If False, usestf.gatherotherwise, a simpletf.matmulgenerates requested embeddings. It seems to me that settinguse_one_hot_embeddingsto False is usually preferred (unless you want your model less parallel; for some reason).token_type_ids, as I understood, is a generalization forsegment_id. It's quite intuitive to add some information to words of the second sequence (e.g. answer) to make the model, somehow, aware that this word belongs to the second sequence. So, for tasks like the question answering,token_type_idsare 0 for words in the question and 1 for words in the answer. One might also imagine a task which has more than two types of token which makestoken_type_idsquite handy.