So I am wondering if it is possible to use the embeddings from BERT to perform unsupervised clustering. I am running into an error in the manner in which the embeddings are being read when I feed them into K means.
I have the following code:
import flair
import torch
from flair.embeddings import WordEmbeddings
from flair.embeddings import CharacterEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import BertEmbeddings
from flair.embeddings import ELMoEmbeddings
from flair.embeddings import FlairEmbeddings
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from flair.data import Sentence
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
bert_embedding = BertEmbeddings(bert_model_or_path='file path')
x = text file
BERT_corpus = []
line_count = 0
for row in x:
sentence = Sentence(row)
embedding = bert_embedding.embed(sentence)
BERT_corpus.append(embedding)
line_count +=1
num_clusters = 150
km = KMeans(n_clusters=num_clusters)
km.fit(BERT_corpus)
However, when I try fit the k means model to the BERT_corpus I receive the following error.
"ValueError: setting an array element with a sequence."
I see that my embeddings are stored in the BERT_corpus as [[Sentence], [Sentence], ... , [Sentence]]. I am wondering how I am best going about converting these into a vector format that can be fed into K means?
Greatly appreciate any advice.
Hello @fareid-32 - do you wish to cluster words or sentences? Depending on which you want to cluster you need to either use word embeddings or sentence embeddings.
If you want to cluster words, you can get word embeddings as described in this tutorial.
If you want to cluster sentences (or documents), you can get embeddings as described in this tutorial.
Hope this helps!
Hey @fareid-32 . Did you have any luck? I'm gonna try to figure this out too.
Hi @BernierCR , yes I did manage to get this working.
The solution i took was to use @hanxiao's "BERT-as-service" : https://github.com/hanxiao/bert-as-service
I found this embedding easy enough to feed into K means. See the sample below:
from bert_serving.client import BertClient
from sklearn.cluster import KMeansbc = BertClient()
BERT_embedding = bc.encode(text_list)km = KMeans(n_clusters=40)
km.fit(BERT_corpus)
clusters = km.labels_.tolist()
Hopefully this helps you out
I do it this way.
X = [ sent.get_embedding().detach().numpy() for sent in dataset["sentence"]]
Basically, the detach().numpy() part which will make it inputable into other models.
@BernierCR very cool - could you paste a minimal complete code example for, say, clustering? It may be good to add this to one of our tutorials.
@fareid-32 can you post the code for this, please? I am also struggling with this
@fareid-32 the result of feed BERT embedding from into K means is better than w2v ?
Most helpful comment
Hi @BernierCR , yes I did manage to get this working.
The solution i took was to use @hanxiao's "BERT-as-service" : https://github.com/hanxiao/bert-as-service
I found this embedding easy enough to feed into K means. See the sample below:
Hopefully this helps you out