Gensim: FastText save & callbacks suspicious behavior

Created on 18 Oct 2018 · 7Comments · Source: RaRe-Technologies/gensim

Description

TODO: FastText model does not learn anything from the text corpus.

Steps/Code/Corpus to Reproduce

import os
import logging

from gensim.models import FastText
from gensim.models.callbacks import CallbackAny2Vec

class EpochSaver(CallbackAny2Vec):
    '''Callback to save model after each epoch and show training parameters '''

    def __init__(self, savedir):
        self.savedir = savedir
        self.epoch = 0
        os.makedirs(self.savedir, exist_ok=True)

    def on_epoch_end(self, model):
        savepath = os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch))
        model.save(savepath)
        print(
            "Epoch saved: {}".format(self.epoch + 1),
            "Start next epoch ... ", sep="\n"
            )
        if os.path.isfile(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1))):
            print("Previous model deleted ")
            os.remove(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1)))
        self.epoch += 1

class SentenceIter:
    def __iter__(self):
        with open("data/eng_tweets/20_news_groups_dataset.txt", "r") as f:
            for line in f:
                yield line[:-1].split(" ")

if __name__ == "__main__":

   logging.basicConfig(
   format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO
   )

   num_workers = os.cpu_count()
   model = FastText(
        SentenceIter(),
        sg=1,
        size=100,
        window=3,
        min_count=5,
        workers=num_workers,
        iter=5,
        negative=20
        callbacks=[EpochSaver("./checkpoints/fasttext_eng_tweets")]
    )

Expected Results

I expect to find in model.most_similar("word") something closer in meaning but found just a trash.
I took an open-source dataset from sklearn.datasets - fetch_20newsgroups.

Actual Results

And it changes very slightly from epoch to epoch, It can change slightly an order of this words, or change their similarity. But nothing changes during training. Nothing learns.

Also, what is important:

If I try to make a fasttext model from command line, I mean using this command:
./fasttext skipgram -input data.txt -output model (https://github.com/facebookresearch/fastText) It shows good results, for example for apple we would receive: apples, apple's and so on.
Also If I change my model from FastText to Word2Vec - I can learn. Results are good.
Also If I don't use my EpochSaver, but just load and save model on each epoch manuall, for example:

for epoch in range(N_epochs):
    train model 
    save model

And then load your model before the next epoch starts, you can also receive good results.

So, the problem can be in EpochSaver, but can you explain please, why in Word2Vec's case it works, but here - don't.

Versions

Linux-4.15.0-24-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1

bug difficulty medium fasttext

Source

daridar

👍3

All 7 comments

Same thing for me. When trained my wiki corpus with word2vec, I got 37% from analogy questions. But when I trained the same corpus with fasttext result is 3.3% from same analogy questions. Is there a problem in fasttext?

Gensim version: 3.6.0
Python Version: 3.6.4
Windows 10

bunyamink on 23 Nov 2018

Thanks for report @daridar, especially (3) makes me think that we have an issue with save method (i.e. this change a current model somehow).

menshikh-iv on 14 Dec 2018

CC @mpenkov

menshikh-iv on 14 Dec 2018

👍1

I have encountered the same problem while I was trying to train FastText model from big dataset.

Here is a simplifed version of the problem.

from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec 
import gensim.downloader as api
import numpy as np
from tqdm import tqdm
from time import sleep
class list_iter:
    def __init__(self,array,model,see=np.nan,only_one_loop=False):
        self.array=array
        self.see=see
        self.model=model
        self.only_one_loop=only_one_loop
        self.tqdm_bar=tqdm(desc='iterations')
    def __iter__(self):
        while True:
            for item in self.array:
                self.tqdm_bar.update(1)

                if self.tqdm_bar.n%self.see==0:
                    print('\nvector hash:'+str(hash(self.model['I'].tostring())))
                    sleep(2)
                    self.model.wv.save("model")
                yield item
            if self.only_one_loop:
                self.tqdm_bar.close()
                break


with open("tinyshakespeare.txt", 'r') as fp:
    corpus=[i.split() for i in fp.read().split('\n')]

model=FastText(workers=1)

model.build_vocab(list_iter(corpus,model,only_one_loop=True))

model.train(list_iter(corpus,model,see=10000),total_examples=99999999999999999,epochs=10)

the output of running this code is:

iterations: 40001it [00:00, 359624.54it/s]
iterations: 8178it [00:00, 81174.89it/s]
vector hash:-4933655588363529352
iterations: 17619it [00:02, 5406.74it/s]
vector hash:-4933655588363529352
iterations: 28266it [00:04, 4400.15it/s]
vector hash:-4933655588363529352
iterations: 39611it [00:06, 4396.16it/s]
vector hash:-4933655588363529352

the vector of the word is not changing and the model is not learning anything.
if i replaced the FastText(workers=1) with Word2Vec(workers=1) everything works fine and make sense and the vector is updated

iterations: 40001it [00:00, 359474.28it/s]
iterations: 0it [00:00, ?it/s]
vector hash:-3094244126925185959
iterations: 19618it [00:02, 6651.19it/s]
vector hash:1153644772814581057
iterations: 22603it [00:04, 3228.06it/s]
vector hash:5947032563406220642
iterations: 30001it [00:06, 3326.54it/s]
vector hash:-7484819002721531784

and by the way you can use any text file.
And i think the problem is not from the save method because even without saving it, the vector is the same after each iteration, when i check the hash of the file its different each time i save it while training, but for some reasons i can't see any changes to the vectors.
even when i tried to get back to gensim 3.1.0 the issue is still there.
why is that?
gensim==3.6.0
python==3.6.4

HashimHL on 17 Jan 2019

I think that i have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!
this is what i have done
pip3 uninstall gensim
then reinstall it with from this commit
pip3 install 'git+git://github.com/RaRe-Technologies/gensim.git@b452a5b59f2f474dbbd275d0838c45df4d3c5aac'
then before i save the model i run this function

self.model.wv.adjust_vectors()
self.model.wv.save("model")

this solution is for my case but if you finished using the training function no need for using model.wv.adjust_vectors() since at the end of train function it does model.wv.adjust_vectors() by it self.

        super(FastText, self).train(
            sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
            epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
            queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
        self.wv.adjust_vectors()

HashimHL on 17 Jan 2019

🎉1