Hi,
I have done an unsupervised training on german wikipedia text. Now I want to retrain this on my task specific texts (also unsupervised) to fine tune the embedding to my domain.
I know that there is a -retrain parameter to fastText but I do not know how to use it. Maybe someone could help...
Thanks
Philip
Hi @PhilipMay,
Unfortunately, it is not possible for the moment to fine tune word vectors using unsupervised data. We are working on this problem, but do not have a satisfying solution to do so yet.
Please note that the -retrain parameter is used for compressing supervised models (more precisely, to fine tune the compressed model). The -pretrainedVectors command line option can be used to initialize word vectors for learning supervised models, which can be seen as a form of (supervised) fine tuning. Unfortunately, the same technique does not work well for unsupervised models.
Best,
Edouard.
Hi @EdouardGrave I guess that you have considered #423
Would be very interesting for me to know why has been discarded
Also, would be very helpful if this info is also in the code since the python wrapper does not advice you about this. I have been trying it this afternoon implementing desm with fasttext and now I understand why I never got a full output matrix
def train_unsupervised(
input,
model="skipgram",
lr=0.05,
dim=100,
ws=5,
epoch=5,
minCount=5,
minCountLabel=0,
minn=3,
maxn=6,
neg=5,
wordNgrams=1,
loss="ns",
bucket=2000000,
thread=multiprocessing.cpu_count() -1,
lrUpdateRate=100,
t=1e-4,
label="__label__",
verbose=2,
pretrainedVectors="",
):
"""
Train an unsupervised model and return a model object.
input must be a filepath. The input text does not need to be tokenized
as per the tokenize function, but it must be preprocessed and encoded
as UTF-8. You might want to consult standard preprocessing scripts such
as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
The input field must not contain any labels or use the specified label prefix
unless it is ok for those words to be ignored. For an example consult the
dataset pulled by the example script word-vector-example.sh, which is
part of the fastText repository.
"""
Hi @EdouardGrave, Do you know if the -pretrainedVectors command line option would be available of unsupervised models any time soon?
Hi @EdouardGrave , does your reply here and in #499 mean that -wordNgrams and -pretrainedVectors is not implemented for unsupervised, or that do they not work well?
Most helpful comment
Hi @PhilipMay,
Unfortunately, it is not possible for the moment to fine tune word vectors using unsupervised data. We are working on this problem, but do not have a satisfying solution to do so yet.
Please note that the
-retrainparameter is used for compressing supervised models (more precisely, to fine tune the compressed model). The-pretrainedVectorscommand line option can be used to initialize word vectors for learning supervised models, which can be seen as a form of (supervised) fine tuning. Unfortunately, the same technique does not work well for unsupervised models.Best,
Edouard.