I want to train a Word2Vec model where I can specify a custom negative subsampling subset for each input word.
This may be implemented by adding an argument for a pre-defined list which defines the negative sampling subset for each word in vocab. Another possible solution could be to allow passing a user function which returns the subset to sample from.
I know that this would sure make the computation much complex, but this is a much-needed feature if someone is trying to regulate the features learned by the model.
Do tell, if more explanation is needed.
Thank you! 馃槂
Could you please be a little clearer? Are you proposing to implement a new feature yourself? Or are you requesting a new feature for someone else to implement?
I am asking it to be implemented by someone.
If there were some published research showing that this improved results for certain problems, then it might get attention from someone who wanted an implementation challenge, or needed it for their own work. Or, if you were submitting working code, along with examples of where it helps, it might get consideration for integration (as long as it didn't harm/slow other uses).
But it's not clear this idiosyncratic variant of word2vec/doc2vec/fasttext/etc would offer any benefits without experimental results, so very unlikely to get someone else to implement it, who didn't already have curiousity/belief in the technique. (For example, I don't quite understand how this is a "much-needed feature if someone is trying to regulate the features learned by the model". Normal negative-sampling seems to work fine, and I'd want to see concrete examples of where this approach is thought to do better.)
@piskvorky I'm closing this because it's very narrow and unlikely to get any attention in the future. Please let me know if you disagree.
@mpenkov 馃憤 I like the abstraction, but the performance cost would be terrible. Not enough demand to warrant it IMO (and maintaining a separate code-path is a no-no).
I found some research papers, where custom negative sampling is useful
Eg: https://dl.acm.org/doi/10.1145/3219819.3219885
It would be better if we gensim can support custom negative sampling method
@piskvorky
Most helpful comment
If there were some published research showing that this improved results for certain problems, then it might get attention from someone who wanted an implementation challenge, or needed it for their own work. Or, if you were submitting working code, along with examples of where it helps, it might get consideration for integration (as long as it didn't harm/slow other uses).
But it's not clear this idiosyncratic variant of word2vec/doc2vec/fasttext/etc would offer any benefits without experimental results, so very unlikely to get someone else to implement it, who didn't already have curiousity/belief in the technique. (For example, I don't quite understand how this is a "much-needed feature if someone is trying to regulate the features learned by the model". Normal negative-sampling seems to work fine, and I'd want to see concrete examples of where this approach is thought to do better.)