Many users on the mailing list pass strings to doc2vec instead of lists.
It results in poor results as vocab is single chars.
A warning will help detect that.
I'll take this up :+1:
Yes! The test could be words is a true string type, and/or that the vocab keys are all single-characters, and/or that the vocab is "very small" (threshold TBD).
Other sanity checks based on frequent issues are also possible. A few ideas:
alpha raised again after descending to min_alpha (may indicate confusion about training cycles)load() on an instance rather than the class (often indicates not saving reference to return value)train() and actual (may indicate skipped steps when doing build_vocab() and train() separately)What about just checking the type of the argument passed ? I can imagine at the beginning of train_document_* functions something like:
if isinstance(doc_words, str) or isinstance(doc_words, unicode):
logger.warn("You are passing strings instead of list of ids for the training of your doc2vec model. Expect to have a vocabulary composed of one-char words.")
Just had a small doubt here: Should I be making an Right I'll go ahead with what @bloody76 suggested.__init__(self, *args, **kwargs) for TaggedDocument to raise those warnings?
Left to implement:
calling
load()on an instance rather than the class
Working on fixing this but I'm not fully certain of how one might go about "calling load() on instance".
Is there a previously known issue/error-report that I could refer to?
@dust0x Class methods and instance methods are different in python, so you can raise warning on one of them.
@tmylk yep, I see that. Could something go wrong when calling load() on instance?
What kind of warning should one be looking for here? A call to load by instance seems to be working fine for me. Am I missing something here?
Do you see the difference in behaviour of the two methods?
Even though calling-on-an-instance might superficially succeed, often the user doing so doesn't realize they'll be getting a newly-loaded instance back as the return value. (Rather, their mental model is that a load is happening _into_ the supplied instance.) To make this clear, and emphasize the intended/correct way to load, it'd be best if load-on-instance throws an error with helpful message that must be corrected.
Could it be easier then to not inherit from utils.SaveLoad?
Sharing save/load utilities via callouts/composition, rather than inheritance, _might_ be a better approach. But, that'd be a pretty big change at this point (and contrast with practices elsewhere in gensim). So, it could require a lot more thought/refactoring and perhaps synchronization with a major-revision-number increment.
For Raise warning if mismatch in expected count for train() and actual., correct me if I am wrong.
In word2vec.py
if total_words is None and total_examples is None:
if self.corpus_count:
total_examples = self.corpus_count
logger.info("expecting %i sentences, matching count from corpus used for vocabulary survey", total_examples)
else:
raise ValueError("you must provide either total_words or total_examples, to enable alpha and progress calculations")
We have to add an else and do the checks.
Or please guide me through.
Thanks.
ping @gojomo @tmylk
I believe both the check-mark to-dos added by @tmylk when re-opening this issue have or are already being addressed elsewhere:
warn if load()-on-instance: #889
warn if mismatch in provided/expected example counts: https://github.com/RaRe-Technologies/gensim/commit/aa8a9cd7902b411cfa2b3cb0f82707103ba03ff2
Since the issue-in-the-title (warn-if-single-characters) is fixed, I'd consider this issue closed, and the other ideas spun out to elsewhere.
Closing as resolved.
Most helpful comment
I'll take this up :+1: