Fasttext: How does unsupervised Retrain work?

Created on 5 Jan 2019  路  5Comments  路  Source: facebookresearch/fastText

Hi,
I have done an unsupervised training on german wikipedia text. Now I want to retrain this on my task specific texts (also unsupervised) to fine tune the embedding to my domain.

  • Is this possible?
  • How is this done?

I know that there is a -retrain parameter to fastText but I do not know how to use it. Maybe someone could help...

Thanks
Philip

Feature request alignmenupdate

Most helpful comment

Hi @PhilipMay,

Unfortunately, it is not possible for the moment to fine tune word vectors using unsupervised data. We are working on this problem, but do not have a satisfying solution to do so yet.

Please note that the -retrain parameter is used for compressing supervised models (more precisely, to fine tune the compressed model). The -pretrainedVectors command line option can be used to initialize word vectors for learning supervised models, which can be seen as a form of (supervised) fine tuning. Unfortunately, the same technique does not work well for unsupervised models.

Best,
Edouard.

All 5 comments

Hi @PhilipMay,

Unfortunately, it is not possible for the moment to fine tune word vectors using unsupervised data. We are working on this problem, but do not have a satisfying solution to do so yet.

Please note that the -retrain parameter is used for compressing supervised models (more precisely, to fine tune the compressed model). The -pretrainedVectors command line option can be used to initialize word vectors for learning supervised models, which can be seen as a form of (supervised) fine tuning. Unfortunately, the same technique does not work well for unsupervised models.

Best,
Edouard.

Hi @EdouardGrave I guess that you have considered #423

Would be very interesting for me to know why has been discarded

Also, would be very helpful if this info is also in the code since the python wrapper does not advice you about this. I have been trying it this afternoon implementing desm with fasttext and now I understand why I never got a full output matrix

def train_unsupervised(
    input,
    model="skipgram",
    lr=0.05,
    dim=100,
    ws=5,
    epoch=5,
    minCount=5,
    minCountLabel=0,
    minn=3,
    maxn=6,
    neg=5,
    wordNgrams=1,
    loss="ns",
    bucket=2000000,
    thread=multiprocessing.cpu_count() -1,
    lrUpdateRate=100,
    t=1e-4,
    label="__label__",
    verbose=2,
    pretrainedVectors="",
):
    """
    Train an unsupervised model and return a model object.
    input must be a filepath. The input text does not need to be tokenized
    as per the tokenize function, but it must be preprocessed and encoded
    as UTF-8. You might want to consult standard preprocessing scripts such
    as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
    The input field must not contain any labels or use the specified label prefix
    unless it is ok for those words to be ignored. For an example consult the
    dataset pulled by the example script word-vector-example.sh, which is
    part of the fastText repository.
    """

Hi @EdouardGrave, Do you know if the -pretrainedVectors command line option would be available of unsupervised models any time soon?

Hi @EdouardGrave , does your reply here and in #499 mean that -wordNgrams and -pretrainedVectors is not implemented for unsupervised, or that do they not work well?

Was this page helpful?
0 / 5 - 0 ratings