Gensim: WikiCorpus: Support random sample

Created on 21 Mar 2015  Â·  7Comments  Â·  Source: RaRe-Technologies/gensim

Related to #307.

random.sample(wiki, n) will get killed on a reasonably spec'd machine, regardless of the size of n, it'd be nice to get a decently sized random sample for creating models quickly.

Most helpful comment

random.sample(population, n) is (very) silently expecting a list. It calls len(population) (not a problem) and population[i] (this is a problem) so it does not work on streams. To be honest, the implementation of random.sample is quiet bad. I will implement the algorithm that does work on streams.
There are two options.

  • Reservoir sampling: have an array of size n and for each element of stream decide, if it should be there or not.
  • Online sampling: if n is large, and we know size of the generator, we do not need to have all samples in memory, but for each element we can decide, if we want to choose it or not. This seems like a better solution.

All 7 comments

I agree, sounds like useful functionality.

There actually exist algorithms that take provably random samples even from a stream (non-repeatable generator).

I actually thought Python's random.sample implements such an algo -- why do you think it wouldn't work? Did you try?

random.sample(population, n) is (very) silently expecting a list. It calls len(population) (not a problem) and population[i] (this is a problem) so it does not work on streams. To be honest, the implementation of random.sample is quiet bad. I will implement the algorithm that does work on streams.
There are two options.

  • Reservoir sampling: have an array of size n and for each element of stream decide, if it should be there or not.
  • Online sampling: if n is large, and we know size of the generator, we do not need to have all samples in memory, but for each element we can decide, if we want to choose it or not. This seems like a better solution.

Resolved in #1408

I see I'm late but a couple thoughts:

Finding the length can require an initial extra full-scan of the texts – allowing a knowledgeable caller to optionally pre-specify a length could skip this potentially costly operation.

Using the global random means any one sampling isn't easily reproducible from run-to-run – adding an optional random seed parameter would allow samples to be reproduced.

I believe that global random actually means better reproducibility! If you manually set the random.seed() in your script, it will set the seed for the library as well. I have tested it only on a toy example so maybe I am horribly wrong, but it should work.

Regarding the length. We should at least mention it in documentation. There are two options. Ask user, to manually set the self.length variable or to add a parameter first_x to the random sample function that would be interpreted as stream length (but not saved to self.length). Then the random sample would be only from first_x elements. However I am not sure what should happend if the first_x would be greater than the real length of stream.
I am not sure how strict gensim wants to be about this type of performance.

If people try pre-seeding the global random to try to achieve reproducibility in this function, it will have side-effects on other consumers of random numbers, and potentially suffer interference from any global-random-consumers in other threads, or in the (interleaved) processing of each next()-offered text. Using a local random instance, explicitly seeded (if a seed provided) ensures no side-effects or interleaved interference.

I don't believe setting a length property on the object will be enough to have built-in len() use that value – that depends on the existence of a __len__() special-method. An optionally-passed in length-hint would definitely solve the performance problem, and you are right to notice that if it were less than the actual length, it'd be essentially a "from the first x" limiter, and if it were longer than the actual length, might result in fewer than n yielded results.

Ou I see. To be honest this will the first time I will use random.Radnom :) . I will add an optional paramer. If set, it will initialize it's own random stream with that seed, otherwise it will use global random.

Setting length should be enough, because __len__ is overridden at TextCorpus to use self.length if defined or to compute it. However another optional parameter should not hurt. @menshikh-iv @piskvorky any ideas?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vlad17 picture vlad17  Â·  4Comments

menshikh-iv picture menshikh-iv  Â·  4Comments

menshikh-iv picture menshikh-iv  Â·  3Comments

Jianqiang picture Jianqiang  Â·  3Comments

Laubeee picture Laubeee  Â·  3Comments