I thought the feature vectors extracted from BERT represents word embeddings.
So I thought, in order to use these embeddings, one just have to extract it (using extract_features.py), then load the weights in an Embedding layer (yes, I'm a Keras person). Then just build whatever we want on the top of this Embedding layer.
But it is wrong, isn't it ? Using extract_features.py, I got the weights of the last 4 layers, for each words in each sentences fed as input !
So instead of having 4 * X weights (X being the size of a layer) as I expected, I have
4 * X * tokens_used_in_input_file weights !
How do I use the Feature vectors to build on top of BERT a task-specific model architecture ?
The embedding table is context-free wordpiece embeddings. These are not particularly useful. They will just be worse versions of what you would get from GloVe/word2vec/FastText etc.
extract_features.py gives you contextual representations, which are "embeddings" of each token in the context of the sentence. This is what you would want to build a model on. For this, you need to run your full training and test data through extract_features.py and use the input vector just like you would use an embedding (to handle the 4x, you can just concatenate the 4 vectors for each word).
you need to run your full training and test data through extract_features.py and use the input vector just like you would use an embedding (to handle the 4x, you can just concatenate the 4 vectors for each word).
Oh I see.
I thought extract_features.py is a script to process the Embeddings and then we can use these wherever we want.
But from what you said, extract_features.py _IS_ the Embeddings layer.
It makes sense, having Embeddings for each words independently would mean no context.
Thank you very much for your kind and clear explanations.
Most helpful comment
The embedding table is context-free wordpiece embeddings. These are not particularly useful. They will just be worse versions of what you would get from GloVe/word2vec/FastText etc.
extract_features.pygives you contextual representations, which are "embeddings" of each token in the context of the sentence. This is what you would want to build a model on. For this, you need to run your full training and test data throughextract_features.pyand use the input vector just like you would use an embedding (to handle the 4x, you can just concatenate the 4 vectors for each word).