Bert: How to use a pre-trained model with additional custom features

Created on 29 Nov 2018 · 7Comments · Source: google-research/bert

Hi,

I'd like to train a layer on top of BERT's pre-trained model, which uses not only the values from BERT's last layer, but also a set of values of features that I calculate externally. I'm trying to see if this is possible with the current Estimator framework that run_classifier is using.

In more details:
I have a single-sentence binary-classification task.
I can use run_classifier to train on my task, in the same way it handles the CoLA dataset. It will learn a layer on top of BERT's pre-trained model, and it will be able to make predictions on unseen sentences.
However, I would like to experiment with a setup in which I add some custom features based on which the prediction should be made. So for each sentence, there will be a text and a set of features that I calculate by myself. Examples of features: the number of words in that sentence, the height of the parse tree of the sentence. I have about 20 features like these, and I can calculate them for each sentence.

So my train.tsv, eval.tsv and test.tsv files look like this:

sentence text
feature 1
feature 2
...
feature 20,
label (true/false)

Is it possible to train and evaluate with such setup?

Is it possible to do so without significant changes to the script run_classifier.py?

Thanks!

Source

assaftibm

Most helpful comment

Is it possible to do so without significant changes to the script run_classifier.py?

No.

Is it possible to train and evaluate with such setup?

Yes. But you can't use run_classifier.py. You need to extract the feature vectors from BERT (as described in this part of the README).

So what you need to do is :

Step 1 : Transform your sentence text into feature vectors, using BERT (extract_features.py)
Step 2 : Build your custom layer, using as input : the feature vectors computed, and all your features. Train this layer using your labels.

astariul-colanim on 30 Nov 2018

👍9

All 7 comments

Is it possible to do so without significant changes to the script run_classifier.py?

No.

Is it possible to train and evaluate with such setup?

Yes. But you can't use run_classifier.py. You need to extract the feature vectors from BERT (as described in this part of the README).

So what you need to do is :

Step 1 : Transform your sentence text into feature vectors, using BERT (extract_features.py)
Step 2 : Build your custom layer, using as input : the feature vectors computed, and all your features. Train this layer using your labels.

astariul-colanim on 30 Nov 2018

👍9

That's very helpful. Thanks. May I ask two more follow-up questions?
From the paper I understand that for sequence classification tasks, Bert uses the word embedding of the first token - CLS - from the last layer (layer -1). So I suppose that this word embedding is used like a sentence embedding.
My questions are:

Is it reasonable to rely on this word embedding for calculating sentence similarity by a measure like cosine-similarity?
Did you encounter cases where it was useful for the down-stream NLP task to use the embedding of CLS from other layers (i.e. not only the final one)?
I ask because ELMo (Peters et al. 2017) reported that in their BiLSTM, they found that the top layer encoded features useful for word sense disambiguation while the lower layer encoded syntactic features that are more useful for tasks like POS tagging. So learning a linear combination of the states in each layer is part of the optimization for each down-stream task.

assaftibm on 1 Dec 2018

Is it reasonable to rely on this word embedding for calculating sentence similarity by a measure like cosine-similarity?

196

Did you encounter cases where it was useful for the down-stream NLP task to use the embedding of CLS from other layers (i.e. not only the final one)?

In the paper they say they got better results by concatenating the last 4 layers.

In the repo linked in #196 (here), they say that the second-to-last layer is a better representation of the sentence

astariul-colanim on 3 Dec 2018

here is a visualization may help you understand the different BERT layers: https://github.com/hanxiao/bert-as-service#q-so-which-layer-and-which-pooling-strategy-is-the-best

hanxiao on 7 Dec 2018

👎5 👍2

I've been playing around with adding extra features, and have written one approach here:
bert_tpu_tweet_model.ipynb. This approach builds extra features directly into the model with fine-tuning, without needing to extract pre-trained features to input (and concatenate new or additional features) into another layer.

This may or may not be helpful, depending on your use case.

mikehikes on 10 Mar 2019

👍2 🎉1

I would highly recommend checking out this repository https://github.com/BrikerMan/Kashgari
They have implemented this feature and it may be useful to you

ishita-gupta98 on 17 Jul 2019

I've been playing around with adding extra features, and have written one approach here:
bert_tpu_tweet_model.ipynb. This approach builds extra features directly into the model with fine-tuning, without needing to extract pre-trained features to input (and concatenate new or additional features) into another layer.

This may or may not be helpful, depending on your use case.

Hi,
did adding new features to the last layer have any effect on the results?