Gensim: Type mismatch in utils.py: use an iterator as an iterable

Created on 13 Jun 2020  Â·  6Comments  Â·  Source: RaRe-Technologies/gensim

I noticed that in both line 1177 and line 1179 _it_ was passed as the first argument to itertools.islice().

Note that itertools.islice() expects its first argument to be an _iterable_ (see this), but _it_ is not necessarily an iterable. It is simply an iterator (see line 1172). So there is a type mismatch.
https://github.com/RaRe-Technologies/gensim/blob/8149035e22c3df932a22fc654ae35942d5e2f866/gensim/utils.py#L1145-L1183

I guess an easy fix to this bug would be to directly pass the provided argument, _iterable_. I didn't see any need to create an iterator for it.

I met this bug when I was using LdaModel. In the initializer of that class it tries to break the corpus, which is an iterable, into chunks using chunkize_serial. I attempted to implement my own corpus to stream documents from the disk. Then I met with a TypeError claiming that the corresponding iterator I implemented was not iterable.

Thanks for taking a look at this!

Most helpful comment

Thanks for the clear example code!

Technically, all iterators are also supposed to be iterable. For example, see: https://docs.python.org/3/glossary.html#term-iterator – which explains (emphasis added):

iterator
An object representing a stream of data. Repeated calls to the iterator’s __next__() method (or passing it to the built-in function next()) return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its __next__() method just raise StopIteration again. _Iterators are required to have an __iter__() method that returns the iterator object itself so every iterator is also iterable and may be used in most places where other iterables are accepted._ One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.

I suspect if your iterator were to implement __iter__(), the error would go away.

All 6 comments

Can you show the actual error you received, and possibly some minimal code to trigger that error?

Sure sure sorry to get back to you a bit late! This is the code that can trigger the error.

import gensim.utils as utils



class SimpleIterator:
    def __init__(self):
        self.pos = -1
        self.array = [i for i in range(10)]
    def __next__(self):
        self.pos += 1
        if self.pos == 10:
            raise StopIteration
        else:
            return self.array[self.pos]

class SimpleIterable:
    def __iter__(self):
        return SimpleIterator()


iterable = SimpleIterable()
print(list(utils.chunkize_serial(iterable = iterable, chunksize = 2)))

And the error is as follows:

TypeError                                 Traceback (most recent call last)
~/Documents/programs/ngramGen.py in 
      22 iterable = SimpleIterable()
----> 23 print(list(utils.chunkize_serial(iterable = iterable, chunksize = 2)))

~/Library/Python/3.7/lib/python/site-packages/gensim/utils.py in chunkize_serial(iterable, chunksize, as_numpy, dtype)
   1177             wrapped_chunk = [[np.array(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]
   1178         else:
-> 1179             wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]
   1180         if not wrapped_chunk[0]:
   1181             break

TypeError: 'SimpleIterator' object is not iterable

Sorry but any feedback upon this...?

Thanks for the clear example code!

Technically, all iterators are also supposed to be iterable. For example, see: https://docs.python.org/3/glossary.html#term-iterator – which explains (emphasis added):

iterator
An object representing a stream of data. Repeated calls to the iterator’s __next__() method (or passing it to the built-in function next()) return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its __next__() method just raise StopIteration again. _Iterators are required to have an __iter__() method that returns the iterator object itself so every iterator is also iterable and may be used in most places where other iterables are accepted._ One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.

I suspect if your iterator were to implement __iter__(), the error would go away.

Yeah I get it! I will have a try -- adding the __iter__() to see if it works!

Closing this issue for now. But if you have fresh results showing a problem actually exists, feel free to update here with those for more consideration.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

k0nserv picture k0nserv  Â·  3Comments

shubhvachher picture shubhvachher  Â·  4Comments

johann-petrak picture johann-petrak  Â·  3Comments

franciscojavierarceo picture franciscojavierarceo  Â·  3Comments

Laubeee picture Laubeee  Â·  3Comments