On Windows, I have it that after initial installation, when importing gensim for the first time in python (Jupyter Notebook) we get:
UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
I've been unable to gather what trouble or missing functionality should we expect, the warning is rather opaque, yet I assume it is there in the code to tell us something, as API users. Which is my lame excuse for having an issue here, so that there's an official comment on what to understand from this warning.
Yes, that warning is rather opaque, sorry about that.
Some algorithms in Gensim (mostly the distributed/parallelized versions) call a function called chunkize, which splits an input stream of records into batches. It works in a streaming manner (lazy batch iteration). chunkize has an optional parameter that can actively prepare and buffer data batches in advance: not quite lazy, but also not eager (buffers a limited fixed number of batches in advance).
This optional functionality is not available on Windows, because it uses multi-processing and is slower. So on Windows, only chunkize_serial is available (no buffering). It's aliased to chunkize for API compatibility reason.
This is a rather technical point, related to performance on slow I/O input streams on Windows. You can probably ignore it.
A PR that clarifies the warning will be very welcome! We should probably only emit that warning when that situation actually happens (chunkize called with maxsize > 0 on Windows), rather than always on import.
Fixed in #2202.
Most helpful comment
Yes, that warning is rather opaque, sorry about that.
Some algorithms in Gensim (mostly the distributed/parallelized versions) call a function called
chunkize, which splits an input stream of records into batches. It works in a streaming manner (lazy batch iteration).chunkizehas an optional parameter that can actively prepare and buffer data batches in advance: not quite lazy, but also not eager (buffers a limited fixed number of batches in advance).This optional functionality is not available on Windows, because it uses multi-processing and is slower. So on Windows, only
chunkize_serialis available (no buffering). It's aliased tochunkizefor API compatibility reason.This is a rather technical point, related to performance on slow I/O input streams on Windows. You can probably ignore it.
A PR that clarifies the warning will be very welcome! We should probably only emit that warning when that situation actually happens (
chunkizecalled withmaxsize > 0on Windows), rather than always on import.