Addons: EmbeddingBag and Product-Key Memory Layers

Created on 14 Oct 2020 · 12Comments · Source: tensorflow/addons

Describe the feature and the current behavior/state.
FAIR have a cool paper where they introduce Product-Key Memory Layers - these are layers that can add a huge number of parameters (100M-1B) to a network with a very minimal compute overhead.

Unfortunately, implementing them efficiently depends on the EmbeddingBag layer from Pytorch. This layer basically does a gather op followed by a weighted sum across the final dimension of the gather indices.

It is trivial to implement this op as a composition of two or three ops in Tensorflow, but doing so requires you to materialize the output of the gather, which in the case of Product-Key Memory layers is enormous, and usually blows out my GPU RAM. By combining these ops into a single efficient call, EmbeddingBag avoids ever materializing the extremely large pre-sum gather output. There's no efficient way to do the same in Tensorflow without a custom op.

I've already gotten a CUDA and (single-threaded) CPU implementation of EmbeddingBag working locally using the custom-op repo and associated docker image. I've verified correctness by comparing outputs and gradients to those from the manual composition of ops, and speed and memory usage are vastly improved. I could also contribute a TF implementation of the Product-Key Memory layer itself if desired.

Relevant information

Are you willing to contribute it (yes/no): yes
Are you willing to maintain it going forward? (yes/no): yes
Is there a relevant academic paper? (if so, where): https://arxiv.org/abs/1907.05242
Is there already an implementation in another framework? (if so, where): Yes, EmbeddingBag is already a PyTorch layer
Was it part of tf.contrib? (if so, where):

Which API type would this fall under (layer, metric, optimizer, etc.)
Layer

Who will benefit with this feature?
People who want to squeeze loads of parameters into their model while maintaining fast throughput and aren't worried about overfitting. The paper used it for big autoregressive NLP Transformers, but I suspect you could deploy it in a lot of other places too.

Any other info.
I have only implemented the portions of EmbeddingBag necessary for Product-Key Memory layers.

Feature Request ecosystem-review layers

Source

Rocketknight1

Most helpful comment

Checked in w/ @ematejska. The internal TF API owners & oss outreach will start a regular (bi-weekly?) review of PRs marked w/ the ecosystem-review label.

tomerk on 20 Nov 2020

👍3

All 12 comments

Have you tried to see if the subgraoh could be already fused https://www.tensorflow.org/lite/convert/operation_fusion?hl=en#wrap_the_composite_operation_in_a_tffunction?

It could be nice to have fusion in TF and the layer here.

bhack on 14 Oct 2020

Check also https://github.com/tensorflow/tensorflow/issues/32675

bhack on 14 Oct 2020

/cc @tanzhenyu @dynamicwebpaige for ecosystem pre-check

bhack on 14 Oct 2020

I don't believe subgraph fusion is possible - I tried using XLA and it didn't resolve the memory issues. I haven't tried TFLite but I would be surprised if this op could be fused automatically, as I had to implement several tricks. In particular, the gradient for the values tensor that is gathered from cannot be an IndexedSlices object, because EmbeddingBag (especially as used in PKM layers) usually gathers many more slices from the values tensor than a normal call to tf.gather(), and so the size of a naive IndexedSlices gradient could be several times larger than the values tensor itself!

Computing the dense values gradient efficiently requires some temp memory and a call to thrust::sort_by_key, plus some custom logic to ensure that CUDA can distribute work efficiently without multiple threads writing to the same entry in the values gradient (This is similar to the PyTorch implementation). I do not think any automatic operator fusion would be able to do this correctly.

Also, I commented in that issue that you linked - after further testing my solution without a custom op turned out to still have huge memory usage compared to the custom op solution, and much worse performance too.

Rocketknight1 on 14 Oct 2020

I haven't tried TFLite but I would be surprised if this op could be fused automatically

It was not related to the TFLite documentation strictly but it is used just to have a documentation pinpoint to the composite ops fusion topic.

Also, I commented in that issue that you linked - after further testing my solution without a custom op turned out to still have huge memory usage compared to the custom op solution, and much worse performance too.

This was just to notify other maintainers about the full history.

bhack on 14 Oct 2020

No problem! Also, one question - although I'm suggesting this as a PR for tf/addons, I'd ideally like to get it into TF itself, since it's already an op in Pytorch and there are a few Transformer derivatives that are using it, which as a result can't be fully implemented in TF.

Is going via tf/addons the right way to do this?

Rocketknight1 on 15 Oct 2020

Yes there Is any specific protocol on which repository to start a FR but in Addons when we receive a feature contribution issue proposal we tag the issue as ecosystem-review to check if TF core, keras-cv, keras-nlp, model garden or any other ecosystem repo Is already working internally on the same feature or they could be interested to have the PR in their repo.

bhack on 15 Oct 2020

Cool! Also, if we do include it as a PR to either tf core or addons, I propose naming it anything except "EmbeddingBag". "gather_sum" or "gather_reduce_sum" are much clearer about what it actually is.

Rocketknight1 on 18 Oct 2020

Just gonna bump this to make sure it doesn't get lost

Rocketknight1 on 27 Oct 2020

/cc Gently ping for @tanzhenyu @dynamicwebpaige for ecosystem pre-check

bhack on 27 Oct 2020

@tomerk Can you help us to route this Ecosystem review?

bhack on 20 Nov 2020

Checked in w/ @ematejska. The internal TF API owners & oss outreach will start a regular (bi-weekly?) review of PRs marked w/ the ecosystem-review label.

tomerk on 20 Nov 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings