Flair: Unsupervised learning with flair

Created on 19 Dec 2019  路  6Comments  路  Source: flairNLP/flair

Hi all,

I've managed to make a supervised text classification with flair, based on my earlier simple clustering of word of documents (tfidf). I would like to do unsupervised clustering itself with flair but don't know where to start.

I have my text divided/transformed to sentences and lemmas, all the words labels like noun, verb etc, and I have words' syntactic dependency relations (like "head word, sub word" thing). I thought I could take the head word and the "next important" words to make "sentence cores" and cluster these. But I would love to do it with flair.

Any thoughts, help, ideas?`

With great respect,
Tare

question wontfix

Most helpful comment

Hi @ristotar,

features like lemmas, syntactic dependency relations, TF-IDF weights etc. are commonly used and well established in _classic_ machine learning; these are hand crafted features, you need domain knowledge about the task to describe an instance to properly classify it into one of your classes. However, as soon as you start with deep learning, you do not need to do this feature engineering anymore, because a neural net will find the necessary features itself. But you still have to represent your text in a numeric form, i. e. one word as a vector in some high-dimensional semantic space. This is what a word embedding is for. As you already have found out, Flair provides a lot of different word embeddings, it's up to you which one works best for your use case, but I would generally recommend a context-dependent word embedding like the FlairEmbeddings. Flair also provides a class, StackedEmbeddings, with which you can combine a classic context-agnostic embedding like WordEmbeddings and any other.

If you use scikit-learn for clustering, say KMeans, you need a matrix, where each row represents one instance, i.e. one sentence, of your dataset, and each column one dimension of the embedding space. Check out the tutorial about document embeddings to represent one sentence as one vector. However, you could do it like this:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence

document_embeddings = DocumentPoolEmbeddings([WordEmbeddings("fi"),
                                              FlairEmbeddings("fi-forward"),
                                              FlairEmbeddings("fi-backward")])

sentences = [Sentence("A Finnish sentence."), Sentence("Another one.")]

matrix = list()
for sentence in sentences:
    document_embeddings.embed(sentence)
    embedding = sentence.get_embedding()
    matrix.append(embedding)

matrix is something you could pass directly to the fit() method of a sklearn.cluster.KMeans object. The next step could be something like dimension reduction with UMAP or t-SNE as @nullphantom said. Reduced your matrix to two dimension, you could plot each sentence and assign a color to each point according to its cluster label and a specific shape according to your text classification you did before.

Good luck!

All 6 comments

Hi,
as far as I know, to do the clustering task we need to embed the sentence first. to plot it into vector maps you must reduce the dimension to x and y. You can use UMAP or other dimensional reduction. Then plot it using sns/matplotlib. After that do the clustering algorithm (kmeans, hiearchial dbscan, etc).

In simple words, flair is use to embed the sentence part. to do other part (dimention reduction, plotting and clustering) you need other tools

Hope this could help

Hi,

Big thanks @nullphantom!

Maybe my problem is on basic level. I have embedded sentences / documents, but how can I get the embedded data out to be used outside flair, e.g. in clustering?

What I've done is:

Tare

Hi @ristotar,

features like lemmas, syntactic dependency relations, TF-IDF weights etc. are commonly used and well established in _classic_ machine learning; these are hand crafted features, you need domain knowledge about the task to describe an instance to properly classify it into one of your classes. However, as soon as you start with deep learning, you do not need to do this feature engineering anymore, because a neural net will find the necessary features itself. But you still have to represent your text in a numeric form, i. e. one word as a vector in some high-dimensional semantic space. This is what a word embedding is for. As you already have found out, Flair provides a lot of different word embeddings, it's up to you which one works best for your use case, but I would generally recommend a context-dependent word embedding like the FlairEmbeddings. Flair also provides a class, StackedEmbeddings, with which you can combine a classic context-agnostic embedding like WordEmbeddings and any other.

If you use scikit-learn for clustering, say KMeans, you need a matrix, where each row represents one instance, i.e. one sentence, of your dataset, and each column one dimension of the embedding space. Check out the tutorial about document embeddings to represent one sentence as one vector. However, you could do it like this:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence

document_embeddings = DocumentPoolEmbeddings([WordEmbeddings("fi"),
                                              FlairEmbeddings("fi-forward"),
                                              FlairEmbeddings("fi-backward")])

sentences = [Sentence("A Finnish sentence."), Sentence("Another one.")]

matrix = list()
for sentence in sentences:
    document_embeddings.embed(sentence)
    embedding = sentence.get_embedding()
    matrix.append(embedding)

matrix is something you could pass directly to the fit() method of a sklearn.cluster.KMeans object. The next step could be something like dimension reduction with UMAP or t-SNE as @nullphantom said. Reduced your matrix to two dimension, you could plot each sentence and assign a color to each point according to its cluster label and a specific shape according to your text classification you did before.

Good luck!

Hi,

Sorry it took me some time but I did this today and: Thank you soooo much, @severinsimmler. This is so easy, and so fast - it helped me forward quite a bit.

Mystery of embeddings is unfolding for me, bit by bit...

Tare

Glad I could help :)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings