I was playing around with different architectures and training techniques and i noticed some odd behavior.
It seems like when we create a SGDW optimizer, and then try to wrap it with either lookahead or SWA, no error is thrown, the weigh_decay is simply truncated from the optimizer.
If you try to wrap a SWA optimizer with lookahead or visa versa, you'll get an explicit error.
In retrospect it makes sense that these 3 optimizer are incompatibility as they all operate on weights, and combining them doesn't make sense.
However i believe it might be a good idea to explicitly throw an error that let's us know, weight decay optimizer is not compatible with lookahead or SWA wrapper, instead of just silently removing the weight decay attribute.
Hi @ben-arnao, it's more than welcome to open an PR for better documentation and error handling! Also, can you provide the minimal runnable code snippet to reproduce what you say? Thank you for the input.
@WindQAQ I think this should work.
from tensorflow_addons.optimizers import Lookahead, SGDW
opt = SGDW(learning_rate=1e-1,
weight_decay=1e-4)
print(hasattr(opt, 'weight_decay'))
opt = Lookahead(opt)
print(hasattr(opt, 'weight_decay'))
Sorry that I rushed to review this yesterday. What do you mean that the weigh_decay is simply truncated from the optimizer?
Lookahead (and SWA) will silently remove the attribute weight_decay from the optimizer as snippet above shows. This makes sense i guess, because they all operate on weights and we can't do weight decay while also doing lookahead/SWA.
The issue i had was that there is no sort of warning or error thrown if you try to wrap an optimizer that has weight decay (SGDW for example), with lookahead/SWA. It's not necessarily a functionality issue but it would be nice if TF gave some feedback.
Also, not sure if there is something deeper in the structure of these optimizers i'm not aware of, but if we try to extend an optimizer with Lookback, and then we try to extend with SWA, we'll get an explicit error thrown that SWA can't extend an optimizer of type Lookback or something of that nature. However if we extend SGD with the decoupled weight decay extension to get SGDW, the base class is still SGD i guess so we don't get the same error.
Ohoh, sorry for misunderstand your meaning. For the optimizer wrapper like lookahead, it is not truncated, you can still access it like
opt._optimizer.weight_decay
https://colab.research.google.com/drive/1b-8JSEMx0Ruqfue9L092sCQxq7-KJaZW?usp=sharing
Can we close this?
This appears to be available if you access the private ._optimizer per above colab example. Closing. Please feel free to comment if this was not as you expected.