Describe the feature and the current behavior/state.
FAIR have a cool paper where they introduce Product-Key Memory Layers - these are layers that can add a huge number of parameters (100M-1B) to a network with a very minimal compute overhead.
Unfortunately, implementing them efficiently depends on the EmbeddingBag layer from Pytorch. This layer basically does a gather op followed by a weighted sum across the final dimension of the gather indices.
It is trivial to implement this op as a composition of two or three ops in Tensorflow, but doing so requires you to materialize the output of the gather, which in the case of Product-Key Memory layers is enormous, and usually blows out my GPU RAM. By combining these ops into a single efficient call, EmbeddingBag avoids ever materializing the extremely large pre-sum gather output. There's no efficient way to do the same in Tensorflow without a custom op.
I've already gotten a CUDA and (single-threaded) CPU implementation of EmbeddingBag working locally using the custom-op repo and associated docker image. I've verified correctness by comparing outputs and gradients to those from the manual composition of ops, and speed and memory usage are vastly improved. I could also contribute a TF implementation of the Product-Key Memory layer itself if desired.
Relevant information
Which API type would this fall under (layer, metric, optimizer, etc.)
Layer
Who will benefit with this feature?
People who want to squeeze loads of parameters into their model while maintaining fast throughput and aren't worried about overfitting. The paper used it for big autoregressive NLP Transformers, but I suspect you could deploy it in a lot of other places too.
Any other info.
I have only implemented the portions of EmbeddingBag necessary for Product-Key Memory layers.
Have you tried to see if the subgraoh could be already fused https://www.tensorflow.org/lite/convert/operation_fusion?hl=en#wrap_the_composite_operation_in_a_tffunction?
It could be nice to have fusion in TF and the layer here.
/cc @tanzhenyu @dynamicwebpaige for ecosystem pre-check
I don't believe subgraph fusion is possible - I tried using XLA and it didn't resolve the memory issues. I haven't tried TFLite but I would be surprised if this op could be fused automatically, as I had to implement several tricks. In particular, the gradient for the values tensor that is gathered from cannot be an IndexedSlices object, because EmbeddingBag (especially as used in PKM layers) usually gathers many more slices from the values tensor than a normal call to tf.gather(), and so the size of a naive IndexedSlices gradient could be several times larger than the values tensor itself!
Computing the dense values gradient efficiently requires some temp memory and a call to thrust::sort_by_key, plus some custom logic to ensure that CUDA can distribute work efficiently without multiple threads writing to the same entry in the values gradient (This is similar to the PyTorch implementation). I do not think any automatic operator fusion would be able to do this correctly.
Also, I commented in that issue that you linked - after further testing my solution without a custom op turned out to still have huge memory usage compared to the custom op solution, and much worse performance too.
I haven't tried TFLite but I would be surprised if this op could be fused automatically
It was not related to the TFLite documentation strictly but it is used just to have a documentation pinpoint to the composite ops fusion topic.
Also, I commented in that issue that you linked - after further testing my solution without a custom op turned out to still have huge memory usage compared to the custom op solution, and much worse performance too.
This was just to notify other maintainers about the full history.
No problem! Also, one question - although I'm suggesting this as a PR for tf/addons, I'd ideally like to get it into TF itself, since it's already an op in Pytorch and there are a few Transformer derivatives that are using it, which as a result can't be fully implemented in TF.
Is going via tf/addons the right way to do this?
Yes there Is any specific protocol on which repository to start a FR but in Addons when we receive a feature contribution issue proposal we tag the issue as ecosystem-review to check if TF core, keras-cv, keras-nlp, model garden or any other ecosystem repo Is already working internally on the same feature or they could be interested to have the PR in their repo.
Cool! Also, if we do include it as a PR to either tf core or addons, I propose naming it anything except "EmbeddingBag". "gather_sum" or "gather_reduce_sum" are much clearer about what it actually is.
Just gonna bump this to make sure it doesn't get lost
/cc Gently ping for @tanzhenyu @dynamicwebpaige for ecosystem pre-check
@tomerk Can you help us to route this Ecosystem review?
Checked in w/ @ematejska. The internal TF API owners & oss outreach will start a regular (bi-weekly?) review of PRs marked w/ the ecosystem-review label.
Most helpful comment
Checked in w/ @ematejska. The internal TF API owners & oss outreach will start a regular (bi-weekly?) review of PRs marked w/ the ecosystem-review label.