D2VTransformer raises if passed a Pandas series with an index that does not contain the key 0:
import pandas as pd
from gensim.sklearn_api import D2VTransformer
from gensim.test.utils import common_texts
series = pd.Series(common_texts)
series.index += 1 # Increment the index so that it does not contain the key 0
transformer = D2VTransformer(min_count=1, size=5)
transformer.fit(series)
Output:
Traceback (most recent call last):
File "main.py", line 9, in <module>
transformer.fit(series)
File "venv/lib/python3.7/site-packages/gensim/sklearn_api/d2vmodel.py", line 162, in fit
if isinstance(X[0], doc2vec.TaggedDocument):
File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 868, in __getitem__
result = self.index.get_value(self, key)
File "venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4375, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
This occurs because the fit and transform methods of D2VTransformer require __getitem__ on the passed iterable not to raise an exception for key 0.
Versions:
Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1
TODO: check other sklearn_api models too – same issue there as in D2VTransformer?
So I need to check here for the first element, not element at index 0?
Will that be enough to close the issue?
@piskvorky ready for review and, hopefully, a merge.
@Hiyorimi just looking at this code again, I've realized we're making a copy of a list. This is unnecessary:
if isinstance([i for i in X[:1]][0], doc2vec.TaggedDocument):
d2v_sentences = X
d2v_sentences = X
You could do something like:
def _get_first(some_list):
for elem in some_list:
return elem
...
if isinstance(_get_first(X), doc2vec.TaggedDocument):
d2v_sentences = X
d2v_sentences = X
WDYT?
Are you sure that it is better rather than making a copy of a slice?
Not 100%. Making a copy of the list seems wasteful, though. Do you disagree?
It is 1 line + copy of 1 element
vs
3 lines of function
I have no clue what is better here.
The standard way to get the first element of a repeatable iterable is next(iter(x)).
For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.
What is the type of X here?
[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.
Since
>>> a = pd.DataFrame([[1,2], [2,4]], columns=['1', '2'])
>>> a.index += 1
>>> _get_first = lambda X: next(iter(X))
>>> _get_first(a)
'1'
I just added method.
Opened a PR
Most helpful comment
The standard way to get the first element of a repeatable iterable is
next(iter(x)).For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change
xitself.What is the type of
Xhere?[i for i in X[:1]][0]is too opaque, I'm +1 on expressing the logic more clearly.