Fasttext: How to use pre-trained word representations for classification?

Created on 5 Sep 2016 · 11Comments · Source: facebookresearch/fastText

The doc is very simple and I'm wondering if we can use a pre-trained word representations for classification. Why I need to do so it that I have large unlabeled dataset but small labeled dataset. I want to train word vectors using the large unlabeled dataset and train the classification model with small labeled dataset.

Source

ljie-PI

👍7

Most helpful comment

As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!

EdouardGrave on 6 Sep 2016

👍11

All 11 comments

As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!

EdouardGrave on 6 Sep 2016

👍11

Hi Edouard,

Is there a timeline for this feature?

ljie-PI on 14 Sep 2016

Hi, I see that you added the -pretrainedVectors parameter. How is it supposed to be used?

rsteca on 10 Oct 2016

Great to have this feature!

ljie-PI on 11 Oct 2016

@rsteca @ljie-PI @EdouardGrave you got any example use case?

I tried this way, I got this big unlabeled data, which is about 575M lines, all cleaned. I got a vector file using skipgram+ns.

I got another test dataset, with 100,000 lines and 25 labels. When I used the previously generated pre-trained vector file for creating supervised classification model for this dataset, there is no improvement in accuracy. It's the same if i don't use preTrained vectors and just use 100,000 line dataset with 25 labels.

Is there anything which I am missing here ?

spate141 on 17 Oct 2016

I think 100,000 labeled samples is already a large dataset.

In my case, I have only 30,000 labeled samples but about 200,000 unlabeled samples. So I think pre-trained vectors may help. However, I am working on other tasks and have not tried to prove it.

However, if we don't have this option, we could not even try to prove my thought.

ljie-PI on 18 Oct 2016

@ljie-PI
how did you get the pre-trained word vectors by unlabeled samples. and have you prove your idea?

recently, I have similar problem of you. I only have 2,000 labeled samples about 20 classed, but more than million unlabeled samples.

fucusy on 10 Nov 2016

@rsteca, @fucusy: The -pretrainedVectors is used to specify a text file containing pre-trained word vectors (e.g. the .vec file outputed by ./fasttext skipgram).

@spate141: when training supervised models with relatively large training sets (such as yours), the use of pre-trained word vectors does not necessarily lead to better performance. You can try to reduce the number of epochs (-epoch) or the learning rate (-lr). We are also working on new methods to improve the use of pre-trained vectors for supervised classification.

EdouardGrave on 16 Nov 2016

@EdouardGrave If i understand correctly, I should increase the epoch and reduce the learning rate right ? I did this already, and also after experimenting with the dataset and models creation part. I got the required accuracy. Anyway, thanks for the reply !

spate141 on 16 Nov 2016

@EdouardGrave
Very very similar, I want to use the pre-trained word vectors from unlabeled data and use it in classification model trainning for limited labeled data unable to get efficient word vectors while trainning.
I noticed that the current fasttext distribution has added the option '-pretrainedVectors' to support pre-trained word vectors.
But what makes me confuesd is that when i use this, it still output a new vec text file when train classification model.
I want to know what's difference between the pre-trained vectors and new supervised trainning output word vectors?
Is the supervised classification model make use of the pre-trained vectors?
My cmd like this:
fasttext supervised -input train.data -pretrainedVectors -output cls.model