Gensim: D2VTransformer raises if passed a Pandas series without index key 0

Created on 14 Jul 2019  Â·  10Comments  Â·  Source: RaRe-Technologies/gensim

D2VTransformer raises if passed a Pandas series with an index that does not contain the key 0:

import pandas as pd
from gensim.sklearn_api import D2VTransformer
from gensim.test.utils import common_texts

series = pd.Series(common_texts)
series.index += 1  # Increment the index so that it does not contain the key 0

transformer = D2VTransformer(min_count=1, size=5)
transformer.fit(series)

Output:

Traceback (most recent call last):
  File "main.py", line 9, in <module>
    transformer.fit(series)
  File "venv/lib/python3.7/site-packages/gensim/sklearn_api/d2vmodel.py", line 162, in fit
    if isinstance(X[0], doc2vec.TaggedDocument):
  File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4375, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

This occurs because the fit and transform methods of D2VTransformer require __getitem__ on the passed iterable not to raise an exception for key 0.

Versions:

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

Hacktoberfest bug difficulty easy impact LOW reach MEDIUM

Most helpful comment

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

All 10 comments

TODO: check other sklearn_api models too – same issue there as in D2VTransformer?

So I need to check here for the first element, not element at index 0?
Will that be enough to close the issue?

@piskvorky ready for review and, hopefully, a merge.

@Hiyorimi just looking at this code again, I've realized we're making a copy of a list. This is unnecessary:

        if isinstance([i for i in X[:1]][0], doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

You could do something like:

def _get_first(some_list):
    for elem in some_list:
        return elem

...

        if isinstance(_get_first(X), doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

WDYT?

Are you sure that it is better rather than making a copy of a slice?

Not 100%. Making a copy of the list seems wasteful, though. Do you disagree?

It is 1 line + copy of 1 element
vs
3 lines of function

I have no clue what is better here.

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

Since

>>> a = pd.DataFrame([[1,2], [2,4]], columns=['1', '2'])
>>> a.index += 1
>>> _get_first = lambda X: next(iter(X))
>>> _get_first(a)
'1'

I just added method.

Opened a PR

Was this page helpful?
0 / 5 - 0 ratings