Gensim: Add option to choose subset for negative sampling

Created on 1 Jul 2019 · 6Comments · Source: RaRe-Technologies/gensim

I want to train a Word2Vec model where I can specify a custom negative subsampling subset for each input word.

This may be implemented by adding an argument for a pre-defined list which defines the negative sampling subset for each word in vocab. Another possible solution could be to allow passing a user function which returns the subset to sample from.

I know that this would sure make the computation much complex, but this is a much-needed feature if someone is trying to regulate the features learned by the model.

Do tell, if more explanation is needed.

Thank you! 😃

feature wishlist

Source

prakhar19

Most helpful comment

If there were some published research showing that this improved results for certain problems, then it might get attention from someone who wanted an implementation challenge, or needed it for their own work. Or, if you were submitting working code, along with examples of where it helps, it might get consideration for integration (as long as it didn't harm/slow other uses).

But it's not clear this idiosyncratic variant of word2vec/doc2vec/fasttext/etc would offer any benefits without experimental results, so very unlikely to get someone else to implement it, who didn't already have curiousity/belief in the technique. (For example, I don't quite understand how this is a "much-needed feature if someone is trying to regulate the features learned by the model". Normal negative-sampling seems to work fine, and I'd want to see concrete examples of where this approach is thought to do better.)

gojomo on 12 Jul 2019

👍2

All 6 comments

Could you please be a little clearer? Are you proposing to implement a new feature yourself? Or are you requesting a new feature for someone else to implement?

mpenkov on 7 Jul 2019

I am asking it to be implemented by someone.

prakhar19 on 10 Jul 2019

gojomo on 12 Jul 2019

👍2

@piskvorky I'm closing this because it's very narrow and unlikely to get any attention in the future. Please let me know if you disagree.

mpenkov on 29 Sep 2019

@mpenkov 👍 I like the abstraction, but the performance cost would be terrible. Not enough demand to warrant it IMO (and maintaining a separate code-path is a no-no).

piskvorky on 29 Sep 2019

👍1

I found some research papers, where custom negative sampling is useful
Eg: https://dl.acm.org/doi/10.1145/3219819.3219885
It would be better if we gensim can support custom negative sampling method
@piskvorky