The doc is very simple and I'm wondering if we can use a pre-trained word representations for classification. Why I need to do so it that I have large unlabeled dataset but small labeled dataset. I want to train word vectors using the large unlabeled dataset and train the classification model with small labeled dataset.
As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!
Hi Edouard,
Is there a timeline for this feature?
Hi, I see that you added the -pretrainedVectors parameter. How is it supposed to be used?
Great to have this feature!
@rsteca @ljie-PI @EdouardGrave you got any example use case?
I tried this way, I got this big unlabeled data, which is about 575M lines, all cleaned. I got a vector file using skipgram+ns.
I got another test dataset, with 100,000 lines and 25 labels. When I used the previously generated pre-trained vector file for creating supervised classification model for this dataset, there is no improvement in accuracy. It's the same if i don't use preTrained vectors and just use 100,000 line dataset with 25 labels.
Is there anything which I am missing here ?
I think 100,000 labeled samples is already a large dataset.
In my case, I have only 30,000 labeled samples but about 200,000 unlabeled samples. So I think pre-trained vectors may help. However, I am working on other tasks and have not tried to prove it.
However, if we don't have this option, we could not even try to prove my thought.
@ljie-PI
how did you get the pre-trained word vectors by unlabeled samples. and have you prove your idea?
recently, I have similar problem of you. I only have 2,000 labeled samples about 20 classed, but more than million unlabeled samples.
@rsteca, @fucusy: The -pretrainedVectors is used to specify a text file containing pre-trained word vectors (e.g. the .vec file outputed by ./fasttext skipgram).
@spate141: when training supervised models with relatively large training sets (such as yours), the use of pre-trained word vectors does not necessarily lead to better performance. You can try to reduce the number of epochs (-epoch) or the learning rate (-lr). We are also working on new methods to improve the use of pre-trained vectors for supervised classification.
@EdouardGrave If i understand correctly, I should increase the epoch and reduce the learning rate right ? I did this already, and also after experimenting with the dataset and models creation part. I got the required accuracy. Anyway, thanks for the reply !
@EdouardGrave
Very very similar, I want to use the pre-trained word vectors from unlabeled data and use it in classification model trainning for limited labeled data unable to get efficient word vectors while trainning.
I noticed that the current fasttext distribution has added the option '-pretrainedVectors' to support pre-trained word vectors.
But what makes me confuesd is that when i use this, it still output a new vec text file when train classification model.
I want to know what's difference between the pre-trained vectors and new supervised trainning output word vectors?
Is the supervised classification model make use of the pre-trained vectors?
My cmd like this:
fasttext supervised -input train.data -pretrainedVectors
mark. same problem with the upper stair.
Most helpful comment
As of now, it is not possible to use pre-trained word representations. However, we are woking on adding this functionality to fastText, and it should be available soon. Stay tuned!