Hi all,
I've managed to make a supervised text classification with flair, based on my earlier simple clustering of word of documents (tfidf). I would like to do unsupervised clustering itself with flair but don't know where to start.
I have my text divided/transformed to sentences and lemmas, all the words labels like noun, verb etc, and I have words' syntactic dependency relations (like "head word, sub word" thing). I thought I could take the head word and the "next important" words to make "sentence cores" and cluster these. But I would love to do it with flair.
Any thoughts, help, ideas?`
With great respect,
Tare
Hi,
as far as I know, to do the clustering task we need to embed the sentence first. to plot it into vector maps you must reduce the dimension to x and y. You can use UMAP or other dimensional reduction. Then plot it using sns/matplotlib. After that do the clustering algorithm (kmeans, hiearchial dbscan, etc).
In simple words, flair is use to embed the sentence part. to do other part (dimention reduction, plotting and clustering) you need other tools
Hope this could help
Hi,
Big thanks @nullphantom!
Maybe my problem is on basic level. I have embedded sentences / documents, but how can I get the embedded data out to be used outside flair, e.g. in clustering?
What I've done is:
Tare
Hi @ristotar,
features like lemmas, syntactic dependency relations, TF-IDF weights etc. are commonly used and well established in _classic_ machine learning; these are hand crafted features, you need domain knowledge about the task to describe an instance to properly classify it into one of your classes. However, as soon as you start with deep learning, you do not need to do this feature engineering anymore, because a neural net will find the necessary features itself. But you still have to represent your text in a numeric form, i. e. one word as a vector in some high-dimensional semantic space. This is what a word embedding is for. As you already have found out, Flair provides a lot of different word embeddings, it's up to you which one works best for your use case, but I would generally recommend a context-dependent word embedding like the FlairEmbeddings. Flair also provides a class, StackedEmbeddings, with which you can combine a classic context-agnostic embedding like WordEmbeddings and any other.
If you use scikit-learn for clustering, say KMeans, you need a matrix, where each row represents one instance, i.e. one sentence, of your dataset, and each column one dimension of the embedding space. Check out the tutorial about document embeddings to represent one sentence as one vector. However, you could do it like this:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence
document_embeddings = DocumentPoolEmbeddings([WordEmbeddings("fi"),
FlairEmbeddings("fi-forward"),
FlairEmbeddings("fi-backward")])
sentences = [Sentence("A Finnish sentence."), Sentence("Another one.")]
matrix = list()
for sentence in sentences:
document_embeddings.embed(sentence)
embedding = sentence.get_embedding()
matrix.append(embedding)
matrix is something you could pass directly to the fit() method of a sklearn.cluster.KMeans object. The next step could be something like dimension reduction with UMAP or t-SNE as @nullphantom said. Reduced your matrix to two dimension, you could plot each sentence and assign a color to each point according to its cluster label and a specific shape according to your text classification you did before.
Good luck!
Hi,
Sorry it took me some time but I did this today and: Thank you soooo much, @severinsimmler. This is so easy, and so fast - it helped me forward quite a bit.
Mystery of embeddings is unfolding for me, bit by bit...
Tare
Glad I could help :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi @ristotar,
features like lemmas, syntactic dependency relations, TF-IDF weights etc. are commonly used and well established in _classic_ machine learning; these are hand crafted features, you need domain knowledge about the task to describe an instance to properly classify it into one of your classes. However, as soon as you start with deep learning, you do not need to do this feature engineering anymore, because a neural net will find the necessary features itself. But you still have to represent your text in a numeric form, i. e. one word as a vector in some high-dimensional semantic space. This is what a word embedding is for. As you already have found out, Flair provides a lot of different word embeddings, it's up to you which one works best for your use case, but I would generally recommend a context-dependent word embedding like the
FlairEmbeddings. Flair also provides a class,StackedEmbeddings, with which you can combine a classic context-agnostic embedding likeWordEmbeddingsand any other.If you use
scikit-learnfor clustering, say KMeans, you need a matrix, where each row represents one instance, i.e. one sentence, of your dataset, and each column one dimension of the embedding space. Check out the tutorial about document embeddings to represent one sentence as one vector. However, you could do it like this:matrixis something you could pass directly to thefit()method of asklearn.cluster.KMeansobject. The next step could be something like dimension reduction with UMAP or t-SNE as @nullphantom said. Reduced yourmatrixto two dimension, you could plot each sentence and assign a color to each point according to its cluster label and a specific shape according to your text classification you did before.Good luck!