Hi,
I'd like to train a layer on top of BERT's pre-trained model, which uses not only the values from BERT's last layer, but also a set of values of features that I calculate externally. I'm trying to see if this is possible with the current Estimator framework that run_classifier is using.
In more details:
I have a single-sentence binary-classification task.
I can use run_classifier to train on my task, in the same way it handles the CoLA dataset. It will learn a layer on top of BERT's pre-trained model, and it will be able to make predictions on unseen sentences.
However, I would like to experiment with a setup in which I add some custom features based on which the prediction should be made. So for each sentence, there will be a text and a set of features that I calculate by myself. Examples of features: the number of words in that sentence, the height of the parse tree of the sentence. I have about 20 features like these, and I can calculate them for each sentence.
So my train.tsv, eval.tsv and test.tsv files look like this:
Is it possible to train and evaluate with such setup?
Is it possible to do so without significant changes to the script run_classifier.py?
Thanks!
Is it possible to do so without significant changes to the script run_classifier.py?
No.
Is it possible to train and evaluate with such setup?
Yes. But you can't use run_classifier.py. You need to extract the feature vectors from BERT (as described in this part of the README).
So what you need to do is :
extract_features.py)That's very helpful. Thanks. May I ask two more follow-up questions?
From the paper I understand that for sequence classification tasks, Bert uses the word embedding of the first token - CLS - from the last layer (layer -1). So I suppose that this word embedding is used like a sentence embedding.
My questions are:
Is it reasonable to rely on this word embedding for calculating sentence similarity by a measure like cosine-similarity?
Did you encounter cases where it was useful for the down-stream NLP task to use the embedding of CLS from other layers (i.e. not only the final one)?
In the paper they say they got better results by concatenating the last 4 layers.
In the repo linked in #196 (here), they say that the second-to-last layer is a better representation of the sentence
here is a visualization may help you understand the different BERT layers: https://github.com/hanxiao/bert-as-service#q-so-which-layer-and-which-pooling-strategy-is-the-best
I've been playing around with adding extra features, and have written one approach here:
bert_tpu_tweet_model.ipynb. This approach builds extra features directly into the model with fine-tuning, without needing to extract pre-trained features to input (and concatenate new or additional features) into another layer.
This may or may not be helpful, depending on your use case.
I would highly recommend checking out this repository https://github.com/BrikerMan/Kashgari
They have implemented this feature and it may be useful to you
I've been playing around with adding extra features, and have written one approach here:
bert_tpu_tweet_model.ipynb. This approach builds extra features directly into the model with fine-tuning, without needing to extract pre-trained features to input (and concatenate new or additional features) into another layer.This may or may not be helpful, depending on your use case.
Hi,
did adding new features to the last layer have any effect on the results?
Most helpful comment
No.
Yes. But you can't use
run_classifier.py. You need to extract the feature vectors from BERT (as described in this part of theREADME).So what you need to do is :
extract_features.py)