Gensim: Error on Word2vec training on tweets

Created on 19 Apr 2017 · 16Comments · Source: RaRe-Technologies/gensim

Hi,

I am trying to train wor2vec embeddings on tweets. I defined the sentence class as follows:

def tokenize_tweets():
    for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
        tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
        try:
            mod_text = tokenize(tweet_text)
            tokens = tknz.tokenize(mod_text)
            if len(tokens) > 0:
                yield tknz.tokenize(mod_text)
            else:
                yield ['NULL']
        except UnicodeEncodeError as e:
            yield ['<NULL>']

Voacb. building from this class runs fine. But when I try running the train method, I am getting the following errors:

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.

Not sure what is wrong with it.

Source

shashankg7

Most helpful comment

@shashankg7 Which version of Gensim are you using? As per the latest release, one would need to explicitly pass epochs parameter and an estimate of corpus size while calling the train function. So you would have to pass these parameters like : vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter). You could read about this here.

chinmayapancholi13 on 19 Apr 2017

👍27 😄3 👎3

All 16 comments

chinmayapancholi13 on 19 Apr 2017

👍27 😄3 👎3

@chinmayapancholi13 I see. It was a silly mistake then. Thanks for your help !

shashankg7 on 19 Apr 2017

@shashankg7 No problem! :) Let me know if you face any other problems. I'd be happy to help.

chinmayapancholi13 on 19 Apr 2017

vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter.after writing this line i am getting the following error
word2vec_model.train(sentences,total_examples=self.corpus_count, epochs=self.iter)
NameError: name 'self' is not defined
pls give me some solution

kumargouravdas on 20 Oct 2017

👎7 👍1

@kumargouravdas Typically, in python, self means link to the current instance of the object and used from within the class. Please remove self & check that all variables exist.

menshikh-iv on 21 Oct 2017

model.train(sentences,total_examples=model.corpus_count,epochs=model.iter)

Savaimaheshwari on 24 Jan 2018

I'm trying to train on word2vec using
model.train(sentences,total_examples=model.corpus_count,epochs=model.epochs)
but i'm getting value error as
ValueError: You must specify an explict epochs count. The usual value is epochs=model.epochs.

abhishek021 on 12 Mar 2018

@abhishek021 can you show, what's values of model.corpus_count and model.epochs?

menshikh-iv on 13 Mar 2018

@menshikh-iv corpus_count =128868 and epochs=5
Thanks for your reply but I'm no more facing this issue, I just restarted notebook and error gone but
now I'm facing another issue with tsne.fit_transform(all_word_vector_matrix)
This dimensionality reduction process consumes my whole disk as well as ram space and system crashes.

abhishek021 on 13 Mar 2018

model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

Savaimaheshwari on 13 Mar 2018

epochs=model.iter
total_examples=model.corpus_count

Savaimaheshwari on 13 Mar 2018

@abhishek021 should definitely work, can you share your model & code & corpus (I'll try to reproduce your error).

menshikh-iv on 13 Mar 2018

model = word2vec.Word2Vec(sg=0, workers=3, size=300, min_count=3,
window=4, hs=1, negative=0)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

    if not os.path.exists("trained"):
        os.makedirs("trained")

    model.save(os.path.join("trained", "model.w2v"))
    model = word2vec.Word2Vec.load(os.path.join("trained", "model.w2v"))
    most_similar_word = model.wv.most_similar_cosmul(word)
    # most_similar_word = model.wv.most_similar(word)
    word_vector =  model[word]

Savaimaheshwari on 13 Mar 2018

👍1

@menshikh-iv Now I'm facing issue with tsne.fit_transform() dimensionality reduction process.
please see my above reply

abhishek021 on 13 Mar 2018

@abhishek021 TSNE from scikit-learn (not Gensim), possible solutions are