Bert: BERT Vector Space shows issues with unknown words

Created on 22 Nov 2018  路  11Comments  路  Source: google-research/bert

I'm comparing via Cosine Similarity the embedding vectors of sentences. A simple version is like

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

This works ok in most cases, but I have some limite cases I don't understand. The model is the cased english model - cased_L-12_H-768_A-12 and I'm using bert-as-service to test this issue.

I compare short sentences to unknown terms - in this case for testing purposes random string of 3 chars:

import sys
import time

from random import choice
import string

import numpy as np

from service.client import BertClient

def GenRandomText(length=8, chars=string.ascii_letters + string.digits):
    return ''.join([choice(chars) for i in range(length)])

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

if __name__ == '__main__':
    from service.client import BertClient
    bc = BertClient(ip='localhost', port=5555)

    for i in range(1, 10):

        leftTerm = GenRandomText(3,string.ascii_letters)
        rightTerm = "how are you today?"

        leftV = bc.encode([leftTerm])
        rightV = bc.encode([rightTerm ])

        cosine_similarity = cosine_sim(leftV[0],rightV[0])

        print("left: %s right: %s distance: %f" % (leftTerm,rightTerm,cosine_similarity) )

This is what happens. The similarity shows high values for these embedding:

left: Dzq right: how are you today? similarity: 0.803445
left: qqC right: how are you today? similarity: 0.713830
left: HSQ right: how are you today? similarity: 0.745146
left: jMB right: how are you today? similarity: 0.831154
left: naR right: how are you today? similarity: 0.861142
left: Bzi right: how are you today? similarity: 0.833868
left: dCc right: how are you today? similarity: 0.815975
left: qCp right: how are you today? similarity: 0.784781
left: wQM right: how are you today? similarity: 0.836569

If I do a cosine similarity or WMD similarity on these sentences and term I get something different:

This are the outputs for left: Dzq right: how are you today? sentences:

{
            "wmd_similarities_norm": [
                -0.11699746160433133
            ],
            "cosine_similarities": [
                0.19988850682356737
            ],
            "wmd_similarities": [
                0.44150126919783433
            ]
        }

where wmd_similarities is the Word Mover's Similarity based on Word Mover's Distance, while cosine_similarities is the Cosine Similarity.
The WMD was calculated using gensim functionality over the FastText Wikipedia model here.

We have also tried different metrics, the results seems to confirm this issue. Here by example given sentences ["drive a coupe you can stand in (it's lit)"] and ["dfg"]

Euclidean distance is 16.9716377258
Manhattan distance is 367.4368
Chebyshev similarity is 0.309262271971
Canberra distance is 533.25599833
Cosine similarity is 0.824640512466
WMT similarity (WORD2VEC) 0.250232081318

Most helpful comment

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

All 11 comments

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

hey @jacobdevlin-google I don't understand from your answer if this is an issue or an expected behavior of BERT when getting the embedding from "not meaningful words" representations or if it is due to the average pooling. Maybe @hanxiao have a better idea?

Let's consider that this behavior in any case isn't what I would expect from a sentence embedding, even when using centroids from words tokens. In fact, as I have showed above the Cosine Similarity (or better the Word Mover's) will have a reasonable "similarity" values among those kinds of token sequences.

To make a real-world example, with a behavior like that, it would be impossible to represent ham or spam tokens (let's say for a classifier task), since the latter tokens seem to be equidistant to all the others (!).

Thank you guys in advance.

@loretoparisi there is a bug in my avg. pooling, max pooling and concat mean max pooling. it is fixed in latest master https://github.com/hanxiao/bert-as-service please check it out and may produce different result.

in principle, this bug affects the most when max_seq_len is much longer than the actual sequence length.

@hanxiao thank you very much for your investigation and fix!

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

Sorry, I don't understand.
I have been trying to fine tune pretrained BERT base uncased model for a classification task,
The accuracy is good, But while I was trying to debug some false cases,
I thought checking the vector representation that is being fed into the classification layer is the key.
So I stored the CLS token representations as it is said that they carry the sentence's meaning or representation for a classification task.
I used cosine similarities between the test case and training samples,
for some it works, but for some the result is really unexpected.

When you said, "And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance."
What did you really mean by that.

How else can one check if the input was similar?
Thanks for any help in advance.

@chikubee How did you work with embedding and cosine similarity?
I am facing same problem as you where I am getting better results with tfidf vector than bert with cosine similarity.

@shauryauppal Bert is a language model, was never really meant for sentence similarity tasks, you can try fine-tuned bert model for sentence similarity and use it as a sentence encoder if you have clean, decently long sentences. You can try USE (Google) and Sentence Transformers (UKPLabs) https://github.com/UKPLab/sentence-transformers.

@shauryauppal @chikubee yes but you can use BERT for the STS task as well with very good results when comparing generated text to the original text:
https://github.com/Tiiiger/bert_score

In my case, it seems like the bert output('pooled_output') can't represent the meaning of a sentence.
Almost all sentences's embedding is similiar.
Maybe, bert should be used in down stream task.

@shauryauppal @chikubee yes but you can use BERT for the STS task as well with very good results when comparing generated text to the original text:
https://github.com/Tiiiger/bert_score

BertScore works well though it is computationally very high but results are good.
Weighted Word2vec also works in the same way.

Was this page helpful?
0 / 5 - 0 ratings