Bert: Obtaining low scores in the STS 2012 task

Created on 15 Nov 2018 · 6Comments · Source: google-research/bert

Hi!

I've been making some experiments with sentence embeddings and using SentEval to obtain results in several tasks. In particular, I've been using the STS 2012 task. I'm opening this issue because using _bert_ is yielding low scores:

ALL (weighted average) : Pearson 0.2364, Spearman: 0.3241
ALL (average): Pearson: 0.2863, Spearman: 0.3503

(As a matter of comparison, both _InferSent_ and _Google Universal Sentence Encoder_ yield between 0.60-0.65 for all of them.)

My approach:

I'm using extract_features.py to obtain the layer values for the top layer (the one noted as -1). Then, I use the vector obtained for the CLS token as the sentence embedding (following what's said in the paper, namely _"In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding."_). I'm using the Bert-Large Uncased model and thus, I lowercase all the sentences in the batcher function of SentEval. The corresponding code is:

def batcher(params, batch):

    # Translating empty lines into something else ([.])
    batch = [sent if sent != [] else ['.'] for sent in batch]

    # Create the output json
    batch_sents = [" ".join([w.lower() for w in sent]) for sent in batch]

    with open("temp_bert_in.txt", 'w') as rb: 
        for line in batch_sents:
            rb.write(line + "\n")

    init_time = time.time()
    subprocess.call(['bash', 'run_bert.sh'])
    print("Creating sent embeds took {:.2f} s".format(time.time()-init_time))

    # Parse the output json with the bert pre-trained embeddings
    json_list = [json.loads(line) for line in open("temp_bert_out.json", "r")]

    # Create the embeddings
    embed_dim = len(json_list[0]["features"][0]["layers"][0]["values"])
    embeddings = np.zeros((len(batch), embed_dim))

    for ix, sentence_json in enumerate(json_list):
        cls_emb = sentence_json["features"][0]["layers"][0]["values"]
        embeddings[ix] = cls_emb

    return embeddings.astype('float32')

The goal of this function is simply to return a matrix with the sentence embeddings for every sentence. The run_bert.sh script is just a way of easily calling the extract_features.py function.

I would like to know if I'm making some logical mistake and not using _bert_ as intended, or if anyone can give me an intuition on why the scores might be so low. Thanks in advance.

Source

PedroMLF

Most helpful comment

note, without fine tuning [CLS] is not a good representation of the sentence. please check out my repo: https://github.com/hanxiao/bert-as-service
which offers a fast and scalable way to extract features of sentences.

hanxiao on 19 Nov 2018

👍3

All 6 comments

hanxiao on 19 Nov 2018

👍3

Ok, I'll look into it. Thanks!

Edit. Even though using your suggested representation helped, using BERT pre-trained model straightaway ended up not being able to outperform other approaches (by a significant margin).

PedroMLF on 19 Nov 2018

@PedroMLF can you share the scores you got for STS 2012? Did BERT perform better than InferSent or Google USE for any particular choice?

mvss80 on 6 Jan 2019

These are the scores I obtained using SentEval:

screenshot from 2019-01-07 10-43-49

In my experiments, BERT did not outperform any of the above shown approaches on all of STS12/13/14/15/16. Results shown below.

screenshot from 2019-01-07 10-48-51

PedroMLF on 7 Jan 2019

👍2

Thanks @PedroMLF - I too checked again to confirm by using mean pooling for layer -2 and means of layers -2,-3,-4,-5 and get results similar to yours.

@jacobdevlin-google any ideas why we are seeing such low numbers. Would you have expected the performance to be significantly better?

mvss80 on 9 Jan 2019

@PedroMLF Hi, have you found the reason that using BERT embedding gives you poor performance? I also used bert sentence embedding for binary classification task, which performance is significantly lower than other approaches.