Gensim: Error on Word2vec training on tweets

Created on 19 Apr 2017  路  16Comments  路  Source: RaRe-Technologies/gensim

Hi,

I am trying to train wor2vec embeddings on tweets. I defined the sentence class as follows:

def tokenize_tweets():
    for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
        tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
        try:
            mod_text = tokenize(tweet_text)
            tokens = tknz.tokenize(mod_text)
            if len(tokens) > 0:
                yield tknz.tokenize(mod_text)
            else:
                yield ['NULL']
        except UnicodeEncodeError as e:
            yield ['<NULL>']


Voacb. building from this class runs fine. But when I try running the train method, I am getting the following errors:

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.

Not sure what is wrong with it.

Most helpful comment

@shashankg7 Which version of Gensim are you using? As per the latest release, one would need to explicitly pass epochs parameter and an estimate of corpus size while calling the train function. So you would have to pass these parameters like : vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter). You could read about this here.

All 16 comments

@shashankg7 Which version of Gensim are you using? As per the latest release, one would need to explicitly pass epochs parameter and an estimate of corpus size while calling the train function. So you would have to pass these parameters like : vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter). You could read about this here.

@chinmayapancholi13 I see. It was a silly mistake then. Thanks for your help !

@shashankg7 No problem! :) Let me know if you face any other problems. I'd be happy to help.

vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter.after writing this line i am getting the following error
word2vec_model.train(sentences,total_examples=self.corpus_count, epochs=self.iter)
NameError: name 'self' is not defined
pls give me some solution

@kumargouravdas Typically, in python, self means link to the current instance of the object and used from within the class. Please remove self & check that all variables exist.

model.train(sentences,total_examples=model.corpus_count,epochs=model.iter)

I'm trying to train on word2vec using
model.train(sentences,total_examples=model.corpus_count,epochs=model.epochs)
but i'm getting value error as
ValueError: You must specify an explict epochs count. The usual value is epochs=model.epochs.

@abhishek021 can you show, what's values of model.corpus_count and model.epochs?

@menshikh-iv corpus_count =128868 and epochs=5
Thanks for your reply but I'm no more facing this issue, I just restarted notebook and error gone but
now I'm facing another issue with tsne.fit_transform(all_word_vector_matrix)
This dimensionality reduction process consumes my whole disk as well as ram space and system crashes.

model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

epochs=model.iter
total_examples=model.corpus_count

@abhishek021 should definitely work, can you share your model & code & corpus (I'll try to reproduce your error).

model = word2vec.Word2Vec(sg=0, workers=3, size=300, min_count=3,
window=4, hs=1, negative=0)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

    if not os.path.exists("trained"):
        os.makedirs("trained")

    model.save(os.path.join("trained", "model.w2v"))
    model = word2vec.Word2Vec.load(os.path.join("trained", "model.w2v"))
    most_similar_word = model.wv.most_similar_cosmul(word)
    # most_similar_word = model.wv.most_similar(word)
    word_vector =  model[word]

@menshikh-iv Now I'm facing issue with tsne.fit_transform() dimensionality reduction process.
please see my above reply

@abhishek021 TSNE from scikit-learn (not Gensim), possible solutions are

  • use sample of word-vectors (because you have no enough of RAM)
  • use simpler methods like PCA

@Savaimaheshwari what is it? Do you have some questions?

@menshikh-iv thanks for the information.
I think 8GB of ram is not enough for my dataset. So I reduced the corpus size and it's working now

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vlad17 picture vlad17  路  4Comments

Laubeee picture Laubeee  路  3Comments

hhchen1105 picture hhchen1105  路  4Comments

johann-petrak picture johann-petrak  路  3Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments