Gensim: Warning if vocabulary is entirely single character elements in doc2vec

Created on 5 May 2016  路  16Comments  路  Source: RaRe-Technologies/gensim

Many users on the mailing list pass strings to doc2vec instead of lists.
It results in poor results as vocab is single chars.

A warning will help detect that.

difficulty easy wishlist

Most helpful comment

I'll take this up :+1:

All 16 comments

I'll take this up :+1:

Yes! The test could be words is a true string type, and/or that the vocab keys are all single-characters, and/or that the vocab is "very small" (threshold TBD).

Other sanity checks based on frequent issues are also possible. A few ideas:

  • model very large compared to single-pass training-data size (risks overfitting)
  • alpha raised again after descending to min_alpha (may indicate confusion about training cycles)
  • calling load() on an instance rather than the class (often indicates not saving reference to return value)
  • mismatch in expected count for train() and actual (may indicate skipped steps when doing build_vocab() and train() separately)

What about just checking the type of the argument passed ? I can imagine at the beginning of train_document_* functions something like:

if isinstance(doc_words, str) or isinstance(doc_words, unicode):
    logger.warn("You are passing strings instead of list of ids for the training of your doc2vec model. Expect to have a vocabulary composed of one-char words.")

Just had a small doubt here: Should I be making an __init__(self, *args, **kwargs) for TaggedDocument to raise those warnings? Right I'll go ahead with what @bloody76 suggested.

Left to implement:

  • [ ] Raise warning if calling load() on an instance rather than the class.
  • [ ] Raise warning if mismatch in expected count for train() and actual.

calling load() on an instance rather than the class

Working on fixing this but I'm not fully certain of how one might go about "calling load() on instance".
Is there a previously known issue/error-report that I could refer to?

@dust0x Class methods and instance methods are different in python, so you can raise warning on one of them.

@tmylk yep, I see that. Could something go wrong when calling load() on instance?

What kind of warning should one be looking for here? A call to load by instance seems to be working fine for me. Am I missing something here?

Do you see the difference in behaviour of the two methods?

Even though calling-on-an-instance might superficially succeed, often the user doing so doesn't realize they'll be getting a newly-loaded instance back as the return value. (Rather, their mental model is that a load is happening _into_ the supplied instance.) To make this clear, and emphasize the intended/correct way to load, it'd be best if load-on-instance throws an error with helpful message that must be corrected.

Could it be easier then to not inherit from utils.SaveLoad?

Sharing save/load utilities via callouts/composition, rather than inheritance, _might_ be a better approach. But, that'd be a pretty big change at this point (and contrast with practices elsewhere in gensim). So, it could require a lot more thought/refactoring and perhaps synchronization with a major-revision-number increment.

For Raise warning if mismatch in expected count for train() and actual., correct me if I am wrong.
In word2vec.py

if total_words is None and total_examples is None:
            if self.corpus_count:
                total_examples = self.corpus_count
                logger.info("expecting %i sentences, matching count from corpus used for vocabulary survey", total_examples)
            else:
                raise ValueError("you must provide either total_words or total_examples, to enable alpha and progress calculations")

We have to add an else and do the checks.
Or please guide me through.
Thanks.

ping @gojomo @tmylk

I believe both the check-mark to-dos added by @tmylk when re-opening this issue have or are already being addressed elsewhere:

warn if load()-on-instance: #889

warn if mismatch in provided/expected example counts: https://github.com/RaRe-Technologies/gensim/commit/aa8a9cd7902b411cfa2b3cb0f82707103ba03ff2

Since the issue-in-the-title (warn-if-single-characters) is fixed, I'd consider this issue closed, and the other ideas spun out to elsewhere.

Closing as resolved.

Was this page helpful?
0 / 5 - 0 ratings