Gensim: D2VTransformer raises if passed a Pandas series without index key 0

Created on 14 Jul 2019 · 10Comments · Source: RaRe-Technologies/gensim

D2VTransformer raises if passed a Pandas series with an index that does not contain the key 0:

import pandas as pd
from gensim.sklearn_api import D2VTransformer
from gensim.test.utils import common_texts

series = pd.Series(common_texts)
series.index += 1  # Increment the index so that it does not contain the key 0

transformer = D2VTransformer(min_count=1, size=5)
transformer.fit(series)

Output:

Traceback (most recent call last):
  File "main.py", line 9, in <module>
    transformer.fit(series)
  File "venv/lib/python3.7/site-packages/gensim/sklearn_api/d2vmodel.py", line 162, in fit
    if isinstance(X[0], doc2vec.TaggedDocument):
  File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4375, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

This occurs because the fit and transform methods of D2VTransformer require __getitem__ on the passed iterable not to raise an exception for key 0.

Versions:

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

Hacktoberfest bug difficulty easy impact LOW reach MEDIUM

Source

rpetchler

Most helpful comment

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

piskvorky on 13 Oct 2019

👍2

All 10 comments

TODO: check other sklearn_api models too – same issue there as in D2VTransformer?

piskvorky on 8 Oct 2019

So I need to check here for the first element, not element at index 0?
Will that be enough to close the issue?

Hiyorimi on 9 Oct 2019

@piskvorky ready for review and, hopefully, a merge.

Hiyorimi on 10 Oct 2019

@Hiyorimi just looking at this code again, I've realized we're making a copy of a list. This is unnecessary:

        if isinstance([i for i in X[:1]][0], doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

You could do something like:

def _get_first(some_list):
    for elem in some_list:
        return elem

...

        if isinstance(_get_first(X), doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

WDYT?

mpenkov on 12 Oct 2019

Are you sure that it is better rather than making a copy of a slice?

Hiyorimi on 12 Oct 2019

Not 100%. Making a copy of the list seems wasteful, though. Do you disagree?

mpenkov on 13 Oct 2019

It is 1 line + copy of 1 element
vs
3 lines of function

I have no clue what is better here.

Hiyorimi on 13 Oct 2019

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

piskvorky on 13 Oct 2019

👍2

Since

>>> a = pd.DataFrame([[1,2], [2,4]], columns=['1', '2'])
>>> a.index += 1
>>> _get_first = lambda X: next(iter(X))
>>> _get_first(a)
'1'

I just added method.

Hiyorimi on 14 Oct 2019

Opened a PR

Hiyorimi on 14 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Bug in Phrases.export_phrases()

jeradf · 4Comments

Word2Vec ns_exponent cannot be changed from default

coopwilliams · 3Comments

Bug in sklearn_api.hdp and in sklearn_api.ldamodel

mmunozm · 3Comments

gensim.similarities.SparseMatrixSimilarity get segmentation-fault

dancinghui · 4Comments

subtle errors in gradient descent update step in word2vec.py

Jianqiang · 3Comments