Addons: Make lookahead, weight decay, and stochastic weight averaging incompatible.

Created on 13 Jul 2020 · 7Comments · Source: tensorflow/addons

I was playing around with different architectures and training techniques and i noticed some odd behavior.

It seems like when we create a SGDW optimizer, and then try to wrap it with either lookahead or SWA, no error is thrown, the weigh_decay is simply truncated from the optimizer.

If you try to wrap a SWA optimizer with lookahead or visa versa, you'll get an explicit error.

In retrospect it makes sense that these 3 optimizer are incompatibility as they all operate on weights, and combining them doesn't make sense.

However i believe it might be a good idea to explicitly throw an error that let's us know, weight decay optimizer is not compatible with lookahead or SWA wrapper, instead of just silently removing the weight decay attribute.

help wanted optimizers

Source

ben-arnao

All 7 comments

Hi @ben-arnao, it's more than welcome to open an PR for better documentation and error handling! Also, can you provide the minimal runnable code snippet to reproduce what you say? Thank you for the input.

WindQAQ on 14 Jul 2020

@WindQAQ I think this should work.

from tensorflow_addons.optimizers import Lookahead, SGDW
opt = SGDW(learning_rate=1e-1,
           weight_decay=1e-4)
print(hasattr(opt, 'weight_decay'))
opt = Lookahead(opt)
print(hasattr(opt, 'weight_decay'))

ben-arnao on 14 Jul 2020

👍1

Sorry that I rushed to review this yesterday. What do you mean that the weigh_decay is simply truncated from the optimizer?

WindQAQ on 15 Jul 2020

Lookahead (and SWA) will silently remove the attribute weight_decay from the optimizer as snippet above shows. This makes sense i guess, because they all operate on weights and we can't do weight decay while also doing lookahead/SWA.

The issue i had was that there is no sort of warning or error thrown if you try to wrap an optimizer that has weight decay (SGDW for example), with lookahead/SWA. It's not necessarily a functionality issue but it would be nice if TF gave some feedback.

Also, not sure if there is something deeper in the structure of these optimizers i'm not aware of, but if we try to extend an optimizer with Lookback, and then we try to extend with SWA, we'll get an explicit error thrown that SWA can't extend an optimizer of type Lookback or something of that nature. However if we extend SGD with the decoupled weight decay extension to get SGDW, the base class is still SGD i guess so we don't get the same error.

ben-arnao on 15 Jul 2020

Ohoh, sorry for misunderstand your meaning. For the optimizer wrapper like lookahead, it is not truncated, you can still access it like

opt._optimizer.weight_decay

https://colab.research.google.com/drive/1b-8JSEMx0Ruqfue9L092sCQxq7-KJaZW?usp=sharing

WindQAQ on 15 Jul 2020

Can we close this?

bhack on 29 Aug 2020

👍1

This appears to be available if you access the private ._optimizer per above colab example. Closing. Please feel free to comment if this was not as you expected.