Flair: Is it possible to feed BERT embeddings from into K means clustering model

Created on 8 Apr 2019 · 7Comments · Source: flairNLP/flair

So I am wondering if it is possible to use the embeddings from BERT to perform unsupervised clustering. I am running into an error in the manner in which the embeddings are being read when I feed them into K means.

I have the following code:

-------------------------------------------

import flair
import torch
from flair.embeddings import WordEmbeddings
from flair.embeddings import CharacterEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import BertEmbeddings
from flair.embeddings import ELMoEmbeddings
from flair.embeddings import FlairEmbeddings
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from flair.data import Sentence
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

bert_embedding = BertEmbeddings(bert_model_or_path='file path')
x = text file

BERT_corpus = []
line_count = 0
for row in x:
sentence = Sentence(row)
embedding = bert_embedding.embed(sentence)
BERT_corpus.append(embedding)
line_count +=1

num_clusters = 150
km = KMeans(n_clusters=num_clusters)
km.fit(BERT_corpus)

-----------------------------------

However, when I try fit the k means model to the BERT_corpus I receive the following error.
"ValueError: setting an array element with a sequence."

I see that my embeddings are stored in the BERT_corpus as [[Sentence], [Sentence], ... , [Sentence]]. I am wondering how I am best going about converting these into a vector format that can be fed into K means?

Greatly appreciate any advice.

question

Source

fareid-32

Most helpful comment

Hi @BernierCR , yes I did manage to get this working.

The solution i took was to use @hanxiao's "BERT-as-service" : https://github.com/hanxiao/bert-as-service
I found this embedding easy enough to feed into K means. See the sample below:

from bert_serving.client import BertClient
from sklearn.cluster import KMeans

bc = BertClient()
BERT_embedding = bc.encode(text_list)

km = KMeans(n_clusters=40)
km.fit(BERT_corpus)
clusters = km.labels_.tolist()

Hopefully this helps you out

fareid-32 on 13 Jul 2019

👍11 ❤1

All 7 comments

Hello @fareid-32 - do you wish to cluster words or sentences? Depending on which you want to cluster you need to either use word embeddings or sentence embeddings.

If you want to cluster words, you can get word embeddings as described in this tutorial.

If you want to cluster sentences (or documents), you can get embeddings as described in this tutorial.

Hope this helps!

alanakbik on 15 Apr 2019

Hey @fareid-32 . Did you have any luck? I'm gonna try to figure this out too.

BernierCR on 13 Jul 2019

Hi @BernierCR , yes I did manage to get this working.

The solution i took was to use @hanxiao's "BERT-as-service" : https://github.com/hanxiao/bert-as-service
I found this embedding easy enough to feed into K means. See the sample below:

from bert_serving.client import BertClient
from sklearn.cluster import KMeans

bc = BertClient()
BERT_embedding = bc.encode(text_list)

km = KMeans(n_clusters=40)
km.fit(BERT_corpus)
clusters = km.labels_.tolist()

Hopefully this helps you out

fareid-32 on 13 Jul 2019

👍11 ❤1

I do it this way.

X = [ sent.get_embedding().detach().numpy() for sent in dataset["sentence"]]

Basically, the detach().numpy() part which will make it inputable into other models.

BernierCR on 18 Jul 2019

👍1

@BernierCR very cool - could you paste a minimal complete code example for, say, clustering? It may be good to add this to one of our tutorials.

alanakbik on 18 Jul 2019

@fareid-32 can you post the code for this, please? I am also struggling with this