Bert: How to integrate bert into a different model?

Created on 10 Feb 2019  Â·  4Comments  Â·  Source: google-research/bert

As pretrained embeddings are very large we need to infer on bert model in the time of training.
Let's say I have a batch of data in following format,
(B, T, I) = (sentence no, words, word_index)
(B, L) = (sentence no, sentence length)
Now Is there any way to do infer on the pretrained BERT model and get an output like below,
(B, T, I, E) = (sentence no, words, word_index, embeddings)
All I see here, https://github.com/google-research/bert/blob/master/run_pretraining.py#L131
But I am not sure what are
token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings
these two arguments and how to get the desired output?

Most helpful comment

I'm not sure if I got the question completely right. But, the only thing you need, to get the BERT embeddings is the sequence of words in your sentence which seems to me that you have in T.

Just remember to use model.get_sequence_output() instead of model.get_pooled_output() after creating the model using modeling.BertModel. The former, as the name suggests, returns a vector for each token in the input sequence.

use_one_hot_embeddings is the method of retrieving word embeddings out of the embedding tensor. If False, uses tf.gather otherwise, a simple tf.matmul generates requested embeddings. It seems to me that setting use_one_hot_embeddings to False is usually preferred (unless you want your model less parallel; for some reason).

token_type_ids, as I understood, is a generalization for segment_id. It's quite intuitive to add some information to words of the second sequence (e.g. answer) to make the model, somehow, aware that this word belongs to the second sequence. So, for tasks like the question answering, token_type_ids are 0 for words in the question and 1 for words in the answer. One might also imagine a task which has more than two types of token which makes token_type_ids quite handy.

All 4 comments

do you mean the true ‘word’ embedding?
bert uses bpe tokenizer to segment an oov word to series of tokens. so, you need to prepare

  1. input sentence : length T
  2. bpe tokenized sentence : length T’
  3. convert it to ids : length T’

    • this is token_ids

    • segment_ids are all 1 with length T’

    • use one hot embeddings flag is just for TPU or GPU.

if you have
(batch size, max sentence size)
then, after converting it, you will get :
(batch size, bert max sentence size)
and feed it to bert model,
you will get :
(batch size, bert max sentence size, bert model size)

as you see, the true word embedding lies across a number of bert embeddings.

in NER task, we can regard the first bert token embedding as an word embedding.

or

we could merge the bert token embeddings to an word embedding.
but, it is hard to compute via tensorflow ops.
(may be pytorch is possible?)

or

from scratch, we could train a bert model by using untokenized version.

finally, if we want to use some kind of word-based feature(say, word embedding from character convolution, part of speech embedding),

we need to extend those feature vector to align with bert vector like :

https://github.com/dsindex/etagger/blob/master/input.py#L105

I'm not sure if I got the question completely right. But, the only thing you need, to get the BERT embeddings is the sequence of words in your sentence which seems to me that you have in T.

Just remember to use model.get_sequence_output() instead of model.get_pooled_output() after creating the model using modeling.BertModel. The former, as the name suggests, returns a vector for each token in the input sequence.

use_one_hot_embeddings is the method of retrieving word embeddings out of the embedding tensor. If False, uses tf.gather otherwise, a simple tf.matmul generates requested embeddings. It seems to me that setting use_one_hot_embeddings to False is usually preferred (unless you want your model less parallel; for some reason).

token_type_ids, as I understood, is a generalization for segment_id. It's quite intuitive to add some information to words of the second sequence (e.g. answer) to make the model, somehow, aware that this word belongs to the second sequence. So, for tasks like the question answering, token_type_ids are 0 for words in the question and 1 for words in the answer. One might also imagine a task which has more than two types of token which makes token_type_ids quite handy.

@dsindex hi there!
Nice repo you have.
I saw you are working on NER. Awesome work. Once I worked with it.
I still don't get it.
What is the segment id? What should be the shape of it?
Apart from that I got my answer.

@sbmaruf

segment_id is always 1’s with length T’ for NER.
if you try to slove a problem with A and B,
(for example, MRC, sentence pair classification)
then, A sentence should have 1’s and B should have 0’s segment_id for segment embedding.

there is the pretrained segment embedding in the pretrained bert. and the size of vocab for segment embedding is 2.

Was this page helpful?
0 / 5 - 0 ratings