Gensim: aliasing chunkize to chunkize_serial warning, on Windows

Created on 12 Jun 2018  路  2Comments  路  Source: RaRe-Technologies/gensim

On Windows, I have it that after initial installation, when importing gensim for the first time in python (Jupyter Notebook) we get:

UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

I've been unable to gather what trouble or missing functionality should we expect, the warning is rather opaque, yet I assume it is there in the code to tell us something, as API users. Which is my lame excuse for having an issue here, so that there's an official comment on what to understand from this warning.

Hacktoberfest documentation good first issue

Most helpful comment

Yes, that warning is rather opaque, sorry about that.

Some algorithms in Gensim (mostly the distributed/parallelized versions) call a function called chunkize, which splits an input stream of records into batches. It works in a streaming manner (lazy batch iteration). chunkize has an optional parameter that can actively prepare and buffer data batches in advance: not quite lazy, but also not eager (buffers a limited fixed number of batches in advance).

This optional functionality is not available on Windows, because it uses multi-processing and is slower. So on Windows, only chunkize_serial is available (no buffering). It's aliased to chunkize for API compatibility reason.

This is a rather technical point, related to performance on slow I/O input streams on Windows. You can probably ignore it.

A PR that clarifies the warning will be very welcome! We should probably only emit that warning when that situation actually happens (chunkize called with maxsize > 0 on Windows), rather than always on import.

All 2 comments

Yes, that warning is rather opaque, sorry about that.

Some algorithms in Gensim (mostly the distributed/parallelized versions) call a function called chunkize, which splits an input stream of records into batches. It works in a streaming manner (lazy batch iteration). chunkize has an optional parameter that can actively prepare and buffer data batches in advance: not quite lazy, but also not eager (buffers a limited fixed number of batches in advance).

This optional functionality is not available on Windows, because it uses multi-processing and is slower. So on Windows, only chunkize_serial is available (no buffering). It's aliased to chunkize for API compatibility reason.

This is a rather technical point, related to performance on slow I/O input streams on Windows. You can probably ignore it.

A PR that clarifies the warning will be very welcome! We should probably only emit that warning when that situation actually happens (chunkize called with maxsize > 0 on Windows), rather than always on import.

Fixed in #2202.

Was this page helpful?
0 / 5 - 0 ratings