TODO: FastText model does not learn anything from the text corpus.
import os
import logging
from gensim.models import FastText
from gensim.models.callbacks import CallbackAny2Vec
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch and show training parameters '''
def __init__(self, savedir):
self.savedir = savedir
self.epoch = 0
os.makedirs(self.savedir, exist_ok=True)
def on_epoch_end(self, model):
savepath = os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch))
model.save(savepath)
print(
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch ... ", sep="\n"
)
if os.path.isfile(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1))):
print("Previous model deleted ")
os.remove(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1)))
self.epoch += 1
class SentenceIter:
def __iter__(self):
with open("data/eng_tweets/20_news_groups_dataset.txt", "r") as f:
for line in f:
yield line[:-1].split(" ")
if __name__ == "__main__":
logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO
)
num_workers = os.cpu_count()
model = FastText(
SentenceIter(),
sg=1,
size=100,
window=3,
min_count=5,
workers=num_workers,
iter=5,
negative=20
callbacks=[EpochSaver("./checkpoints/fasttext_eng_tweets")]
)
I expect to find in model.most_similar("word") something closer in meaning but found just a trash.
I took an open-source dataset from sklearn.datasets - fetch_20newsgroups.

And it changes very slightly from epoch to epoch, It can change slightly an order of this words, or change their similarity. But nothing changes during training. Nothing learns.
Also, what is important:
If I try to make a fasttext model from command line, I mean using this command:
./fasttext skipgram -input data.txt -output model (https://github.com/facebookresearch/fastText) It shows good results, for example for apple we would receive: apples, apple's and so on.
Also If I change my model from FastText to Word2Vec - I can learn. Results are good.
Also If I don't use my EpochSaver, but just load and save model on each epoch manuall, for example:
for epoch in range(N_epochs):
train model
save model
And then load your model before the next epoch starts, you can also receive good results.
So, the problem can be in EpochSaver, but can you explain please, why in Word2Vec's case it works, but here - don't.
Linux-4.15.0-24-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1
Same thing for me. When trained my wiki corpus with word2vec, I got 37% from analogy questions. But when I trained the same corpus with fasttext result is 3.3% from same analogy questions. Is there a problem in fasttext?
Gensim version: 3.6.0
Python Version: 3.6.4
Windows 10
Thanks for report @daridar, especially (3) makes me think that we have an issue with save method (i.e. this change a current model somehow).
CC @mpenkov
I have encountered the same problem while I was trying to train FastText model from big dataset.
Here is a simplifed version of the problem.
from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import numpy as np
from tqdm import tqdm
from time import sleep
class list_iter:
def __init__(self,array,model,see=np.nan,only_one_loop=False):
self.array=array
self.see=see
self.model=model
self.only_one_loop=only_one_loop
self.tqdm_bar=tqdm(desc='iterations')
def __iter__(self):
while True:
for item in self.array:
self.tqdm_bar.update(1)
if self.tqdm_bar.n%self.see==0:
print('\nvector hash:'+str(hash(self.model['I'].tostring())))
sleep(2)
self.model.wv.save("model")
yield item
if self.only_one_loop:
self.tqdm_bar.close()
break
with open("tinyshakespeare.txt", 'r') as fp:
corpus=[i.split() for i in fp.read().split('\n')]
model=FastText(workers=1)
model.build_vocab(list_iter(corpus,model,only_one_loop=True))
model.train(list_iter(corpus,model,see=10000),total_examples=99999999999999999,epochs=10)
the output of running this code is:
iterations: 40001it [00:00, 359624.54it/s]
iterations: 8178it [00:00, 81174.89it/s]
vector hash:-4933655588363529352
iterations: 17619it [00:02, 5406.74it/s]
vector hash:-4933655588363529352
iterations: 28266it [00:04, 4400.15it/s]
vector hash:-4933655588363529352
iterations: 39611it [00:06, 4396.16it/s]
vector hash:-4933655588363529352
the vector of the word is not changing and the model is not learning anything.
if i replaced the FastText(workers=1) with Word2Vec(workers=1) everything works fine and make sense and the vector is updated
iterations: 40001it [00:00, 359474.28it/s]
iterations: 0it [00:00, ?it/s]
vector hash:-3094244126925185959
iterations: 19618it [00:02, 6651.19it/s]
vector hash:1153644772814581057
iterations: 22603it [00:04, 3228.06it/s]
vector hash:5947032563406220642
iterations: 30001it [00:06, 3326.54it/s]
vector hash:-7484819002721531784
and by the way you can use any text file.
And i think the problem is not from the save method because even without saving it, the vector is the same after each iteration, when i check the hash of the file its different each time i save it while training, but for some reasons i can't see any changes to the vectors.
even when i tried to get back to gensim 3.1.0 the issue is still there.
why is that?
gensim==3.6.0
python==3.6.4
I think that i have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!
this is what i have done
pip3 uninstall gensim
then reinstall it with from this commit
pip3 install 'git+git://github.com/RaRe-Technologies/gensim.git@b452a5b59f2f474dbbd275d0838c45df4d3c5aac'
then before i save the model i run this function
self.model.wv.adjust_vectors()
self.model.wv.save("model")
this solution is for my case but if you finished using the training function no need for using model.wv.adjust_vectors() since at the end of train function it does model.wv.adjust_vectors() by it self.
super(FastText, self).train(
sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
self.wv.adjust_vectors()
I think that I have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!
yes, exactly, big thanks @mpenkov that helps us much with fasttext-related issues in #2313
I guess I can close this issue as fixed by #2313
CC: @mpenkov