Gensim: D2VTransformer.fit_transform doesn't work

Created on 9 Jan 2018  Â·  4Comments  Â·  Source: RaRe-Technologies/gensim

The X parameter of the fit_transform method of D2VTransformer doesn't accept variables of any type, nor list of token lists (raises _AttributeError: 'list' object has no attribute 'words'_), nor list of TaggedDocument (raises _TypeError: sequence item 0: expected str instance, list found_).

Example:

from gensim.sklearn_api import D2VTransformer
from gensim.models import doc2vec

class_dict = {'mathematics': 1, 'physics': 0}
train_data = [
    (['calculus', 'mathematical'], 'mathematics'), (['geometry', 'operations', 'curves'], 'mathematics'),
    (['natural', 'nuclear'], 'physics'), (['science', 'electromagnetism', 'natural'], 'physics')
]
d2v_sentences = [doc2vec.TaggedDocument(words[0], [i]) for i, words in enumerate(train_data)]
train_input = list(map(lambda x: x[0], train_data))
train_target = list(map(lambda x: class_dict[x[1]], train_data))

model = D2VTransformer(min_count=1)
model.fit_transform(train_input, train_target)
#model.fit_transform(d2v_sentences, train_target)

Versions:
Windows-10-10.0.16299-SP0
Python 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.3
SciPy 0.19.1
gensim 3.2.0
FAST_VERSION 1

bug difficulty easy

All 4 comments

Thanks for report @volj1!
@chinmayapancholi13 can you look into this? This looks really suspicious.

@menshikh-iv Sure Ivan. I'll try to look into the reason for this behavior.

Hi, The problem is:
train_input format will not work for the fit method of gensim’s sklearn_api of d2vmodel.py as it implements Doc2vec model in it which works for TaggedDocument format only (and as TD format does have a '_.words_' attribute and hence won’t give the _AttributeError_)

d2v_sentences format will not work for the transform method of gensim’s sklearn_api of d2vmodel.py as we can directly see in the method’s docstring that it expects a list of list input format (which would be satisfied by train_input) and not TaggedDocument format

Now, in the documentation of Sklearn api for doc2vec model, the “fit” and “transform” method are shown to be used separately with their respective input formats but @volj1 implemented the “fit_transform” method which basically does it with single line in scikit-learn ie.: return self.fit(X, y, **fit_params).transform(X)

One approach could be to convert the input format into TaggedDocuments format internally in fit method, so that user is required to input the train_input format only. I have opened a WIP PR with that approach here

Was this page helpful?
0 / 5 - 0 ratings