Fasttext: How to use pre-trained word representations for classification?

Created on 5 Sep 2016  路  11Comments  路  Source: facebookresearch/fastText

The doc is very simple and I'm wondering if we can use a pre-trained word representations for classification. Why I need to do so it that I have large unlabeled dataset but small labeled dataset. I want to train word vectors using the large unlabeled dataset and train the classification model with small labeled dataset.

Most helpful comment

As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!

All 11 comments

As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!

Hi Edouard,

Is there a timeline for this feature?

Hi, I see that you added the -pretrainedVectors parameter. How is it supposed to be used?

Great to have this feature!

@rsteca @ljie-PI @EdouardGrave you got any example use case?

I tried this way, I got this big unlabeled data, which is about 575M lines, all cleaned. I got a vector file using skipgram+ns.

I got another test dataset, with 100,000 lines and 25 labels. When I used the previously generated pre-trained vector file for creating supervised classification model for this dataset, there is no improvement in accuracy. It's the same if i don't use preTrained vectors and just use 100,000 line dataset with 25 labels.

Is there anything which I am missing here ?

I think 100,000 labeled samples is already a large dataset.

In my case, I have only 30,000 labeled samples but about 200,000 unlabeled samples. So I think pre-trained vectors may help. However, I am working on other tasks and have not tried to prove it.

However, if we don't have this option, we could not even try to prove my thought.

@ljie-PI
how did you get the pre-trained word vectors by unlabeled samples. and have you prove your idea?

recently, I have similar problem of you. I only have 2,000 labeled samples about 20 classed, but more than million unlabeled samples.

@rsteca, @fucusy: The -pretrainedVectors is used to specify a text file containing pre-trained word vectors (e.g. the .vec file outputed by ./fasttext skipgram).

@spate141: when training supervised models with relatively large training sets (such as yours), the use of pre-trained word vectors does not necessarily lead to better performance. You can try to reduce the number of epochs (-epoch) or the learning rate (-lr). We are also working on new methods to improve the use of pre-trained vectors for supervised classification.

@EdouardGrave If i understand correctly, I should increase the epoch and reduce the learning rate right ? I did this already, and also after experimenting with the dataset and models creation part. I got the required accuracy. Anyway, thanks for the reply !

@EdouardGrave
Very very similar, I want to use the pre-trained word vectors from unlabeled data and use it in classification model trainning for limited labeled data unable to get efficient word vectors while trainning.
I noticed that the current fasttext distribution has added the option '-pretrainedVectors' to support pre-trained word vectors.
But what makes me confuesd is that when i use this, it still output a new vec text file when train classification model.
I want to know what's difference between the pre-trained vectors and new supervised trainning output word vectors?
Is the supervised classification model make use of the pre-trained vectors?
My cmd like this:
fasttext supervised -input train.data -pretrainedVectors -output cls.model

mark. same problem with the upper stair.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AhmedIdr picture AhmedIdr  路  3Comments

hughbzhang picture hughbzhang  路  3Comments

flybirp picture flybirp  路  4Comments

alanorth picture alanorth  路  3Comments

ragvri picture ragvri  路  3Comments