Bert: How to extract the word embedding parameters from the pretrained files?

Created on 1 Feb 2019  路  3Comments  路  Source: google-research/bert

Hi there, can anyone give some tips on extracting the word embedding parameters from the pretrained files?

Most helpful comment

Yes,
you could extract the first layer's embedding and use it as it is for any further task.
But don't be surprised if these first layer embeddings don't work well on your task since this is not the only way to extract embeddings from BERT. Its architecture is a stack of 12 Transformer Encoders stacked on top of each other, and each encoder generates an embedding. You could use any combinations of these encoder embeddings (for eg. last 4 layers embeddings, first 3 layers embeddings etc.) and either average them or contact them and then use this as the final BERT embedding.
This combination totally depends on your task that you're trying to use BERT on, and might have to try some out a bit.

For more information on this, kindly refer the last section of the original BERT paper
https://arxiv.org/pdf/1810.04805.pdf
You'll find it under section 5.4 Feature-based Approach with BERT.

Hope this helps!

All 3 comments

You could use extract_features.py as a guide.Look for the topic 'BERT for feature extraction' in this post

You could use extract_features.py as a guide.Look for the topic 'BERT for feature extraction' in this post

Maybe I missunderstood something, but shouldn't the very first layer (i.e. -12 for the base model) give a straight word embedding vector? Given the size of the underlying corpus, I would then expect to be able to do the typical "arithmetic" like king - man + woman = queen, when using the encoding of the individual words. However, that does not seem to be working.

Yes,
you could extract the first layer's embedding and use it as it is for any further task.
But don't be surprised if these first layer embeddings don't work well on your task since this is not the only way to extract embeddings from BERT. Its architecture is a stack of 12 Transformer Encoders stacked on top of each other, and each encoder generates an embedding. You could use any combinations of these encoder embeddings (for eg. last 4 layers embeddings, first 3 layers embeddings etc.) and either average them or contact them and then use this as the final BERT embedding.
This combination totally depends on your task that you're trying to use BERT on, and might have to try some out a bit.

For more information on this, kindly refer the last section of the original BERT paper
https://arxiv.org/pdf/1810.04805.pdf
You'll find it under section 5.4 Feature-based Approach with BERT.

Hope this helps!

Was this page helpful?
0 / 5 - 0 ratings