Addons: Make API for custom optimizer wrappers more consistent

Created on 2 Oct 2020  路  18Comments  路  Source: tensorflow/addons

There are several optimizer classes which wrap an existing optimizer and add functionality, but they have slightly different APIs The Lookahead, MovingAverage, and SWA each take an optimizer in its constructor and adds additional functionality. Similarly, LossScaleOptimizer from Keras, an optimizer which can prevent numeric underflow by scaling loss and gradients, takes an optimizer in its constructor as well. On the other hand, the extend_with_decoupled_weight_decay function instead creates a new dynamically-created class which subclasses from the original optimizer's class but adds functionality. I think we should make the API for all these classes consistent, so that all optimizer wrappers have the same API.

LossScaleOptimizer is being made non-experimental in TF 2.4, so I need to decide on the final API soon. Ideally, it would be consistent with the optimizer APIs in TF-Addons, which is why I am bringing up this issue. There is a mixed precision RFC describing the proposed changes (no need to read it, I will summarize everything important here).

I summarize the two approaches an optimizer extension can take below:

1. Continue wrapping optimizers with another optimizer class:

This approach continues having optimizers like Lookahead and LossScaleOptimizer wrap other optimizers. An OptimizerWrapper helper class can be created, which automatically deals with wrapping logic such as implementing get_config properly so each optimizer wrapper can focus on just the added functionality.

One question is how to whether __getattribute__ and __setattr__ should be delegate hyperparameters (and potentially other attributes) to the inner optimizer, which could be done in OptimizerWrapper. I proposed doing this in the mixed precision RFC, but can change it. Doing this delegation would allow attributes of the inner optimizer to be accessed Lookahead, LossScaleOptimizer, and other wrappers:

opt = tf.keras.optimizers.SGD(0.1, momentum=0.1)
opt = tfa.optimizers.Lookahead(opt)
...  # Do some training
print(opt.momemtum)
opt.momentum = 0.05

Delegating all attributes in __getattribute__ is tempting, but it does have a significant flaw: it can cause the custom logic in optimizer wrappers not to be called:

class MyCustomOptimizer(tf.keras.optimizers.SGD):

  def apply_gradients_zero_min(self, grads_and_vars):
    grads_and_vars = [(tf.nn.relu(g), v) for g, v in grads_and_vars]
    self.apply_gradients(grads_and_vars)

opt = MyCustomOptimizer(1.)
opt = tf.keras.mixed_precision.LossScaleOptimizer(opt)
opt.apply_gradients_zero_min(...)  # Bug! Does not call LossScaleOptimizer version of apply_gradients.

Custom logic in the function LossScaleOptimizer.apply_gradients is not called, as only the apply_gradients of the inner optimizer is called. Similar issues occur with any optimizer wrapper.

2. Dynamically create a subclass

This approach is used by extend_with_decoupled_weight_decay. It consists of dynamically creating a subclass with two superclasses: The original optimizer (e.g. Adam) and a special extension class which adds the extra optimizer functionality. For example, see the implementation of extend_with_decoupled_weight_decay. Other optimizers could use the same approach.

We want to allow users to, e.g., create an Adam optimizer then pass it to a function which returns an optimizer of the new type. To allow this, we could have a function which takes in the old optimizer, creates the new class and returns an optimizer of the new class, e.g.:

opt = tf.keras.optimizers.SGD(0.1, momentum=0.1)
opt = tfa.optimizers.copy_with_decoupled_weight_decay(opt)

In this case, copy_with_decoupled_weight_decay would create the new optimizer class then use opt.get_config to serialize the old optimizer and from_config to deserialize it into the new optimizer. One flaw is this requires the optimzier to be serializable with get_config and from_config Another flaw is that users might accidentally modify the old optimizer, not realizing that this does not modify the new optimizer.

Alternatively, we could have a function monkey-patch opt.__class__ to the new optimizer class, so the optimizer does not need to be serializable with get_config and from_config:

opt = tf.keras.optimizers.SGD(0.1, momentum=0.1)
# Monkey-patches opt.__class__ to an optimizer class which subclasses 
# both DecoupledWeightDecayExtension and SGD
tfa.optimizers.modify_class_to_use_decoupled_weight_decay(opt)

Which to choose?

Does anyone have any opinions on which approach is superior, and whether all the optimizers should be unified to use the same approach? I think I prefer dynamically creating a subclass because it completely emulates the API of the original optimizer's class.

/CC @cyberzhg, @bhack, @squadrick, @shreyashpatodia, @philjd, who have worked on optimizers in TF-Addons
/CC @omalleyt12, @tomerk, who have worked on Keras optimizers

discussion needed optimizers

Most helpful comment

All 18 comments

I don't have a strong opinion on this, as you stated, both approaches have their pros and cons. Note though that I haven't worked with Keras lately, so I'm not familiar with the config/scheduler interface and might miss important points.

I'm guilty of introducing the dynamic subclass approach, to keep the API consistent by having an optimizer class that can be instantiated like all standard optimizers. This comes at the cost of somewhat reduced readability due to multiple inheritance. Since then the optimizer logic has become more complex with initialization from config/schedulers etc., making this harder to maintain and read, so I'm very open to an alternative.
However, if going with 1, it's probably safest to allow custom optimizers to overwrite the wrapper classes *apply* functions only to prevent the problem you mentioned and have the optimizer fail loudly otherwise. I don't know if this would be flexible enough to cover all use cases. Unfortunately, the effort to clean this up (tensorflow/community/pull/234) and make the API consistent is on hold.

One thing I'm trying to understand is the need for creating the optimizer from an existing instance. What's the advantage over creating a wrapped optimizer class and directly instantiating this class?
I guess when wrapping an instance, (1) is the more intuitive approach, while (2) works better for the latter.

I think the dynamic subclass approach is pretty clean compared to wrapping an optimizer, as it completely emulates the interface of the original optimizer. It's a bit hard to read, but I think it will be easy to maintain. The DecoupledWeightDecayExtension.from_config method can be modified to call super().from_config, which I think would then make it work even if the base Optimizer changes its from_config implementation.

tensorflow/community/pull/234 will likely be approved soon and the transform_gradients and aggregate_gradients constructor arguments have been implemented (they have been renamed to gradient_transformers and gradient_aggregator respectively). However, the implementation is not flexible enough to support the existing optimizer wrappers in Addons and Keras.

One thing I'm trying to understand is the need for creating the optimizer from an existing instance. What's the advantage over creating a wrapped optimizer class and directly instantiating this class?
I guess when wrapping an instance, (1) is the more intuitive approach, while (2) works better for the latter.

I think it's a better API to convert an existing optimizer, especially when a model supports specify which optimizer to use based on a flag. Compare the two approaches for using decoupled weight decay:

# Approach 1: Use existing API
if FLAGS.optimizer == 'sgd':
  opt_type = tf.keras.optimizers.SGD
  kwargs = {'learning_rate': 0.1}
else:
  opt_type = tf.keras.optimziers.Adam(0.1, beta_1=0.99, beta_2=0.999)
  kwargs = {'learning_rate': 0.1, 'beta_1': 0.99, 'beta_2': 0.999}

if FLAGS.use_dwd:
  opt_type = tfa.optimizers.extend_with_decoupled_weight_decay(opt_type)
  kwargs['weight_decay'] = 0.1

opt = opt_type(**kwargs)
# Approach 2: API which uses existing optimizer instance
if FLAGS.optimizer == 'sgd':
  opt = tf.keras.optimizers.SGD(learning_rate=0.1)
else:
  opt = tf.keras.optimizers.Adam(learning_rate=0.1, beta_1=0.99, beta_2=0.990)

opt = tfa.optimizers.copy_with_decoupled_weight_decay(opt, weight_decay=0.1)

Approach 2 is both shorter and clearer.

What about the Multioptimizer for the dynamic approach case? https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/discriminative_layer_training.py

I didn't know about Multioptimizer until now. Multioptimizer will not work with the dynamic approach case. However, it also doesn't fully work with my proposed optimizer wrapper case, as we cannot delegate __getattribute__ (because there are multiple inner optimizers so we don't know which to delegate to). I think we should treat Multioptimizer as a special case that doesn't have to have the same API as other optimizer wrappers. Most other wrappers extend the functionality of an optimizer in some way, but Multioptimizer instead is an API to use different optimizers for different sets of weights

Cannot we make an hypotesis about extending copy_with_decoupled_weight_decay to handle also the input types that already handle the Multioptimizer?
It is a little bit true that It could be a special case but at the end other then "special inputs" we just overide/wrap apply_gradients. So anyhting so special.

By "special inputs" do you mean to have copy_with_decoupled_weight_decay also allow a list of (optimizer, layers) pairs to be passed, like as MultiOptimizer.__init__, instead of the single optimizer? The issue with doing that is that copy_with_decoupled_weight_decay would create a dynamic subclass that subclasses from the original optimizer's class. But if there are multiple original optimizers, there is no single class to subclass from.

For "special inputs" I meant that but if it is really a "very special case" cannot we dynaically subclass to Multioptimizer directly? Or we could have other more general use cases for this kind of inputs?

Multioptimizer cannot be implemented as a dynamic subclass. Since what would its superclasses be? We have multiple original optimizers for Multioptimizer, but we cannot subclass all of them, just one.

I don't think its worth having a general use case for wrapping multiple optimizers, as in Multioptimizer, since it isn't very common. There are many wrapper-like classes that wrap just one optimizer (Lookahead, extend_with_decoupled_weight_decay, LossScaleOptimizer) but just one that wraps multiple optimizers (Multioptimizer). Also, Multioptimizer is unlike the other wrappers in other way as well: it does not actually modify the logic in how gradients are applied to variables, but just directly delegates to one of the inner optimizers for each weight.

For the second part ok It could be just an evaluation about that multioptimizer will be really the only case to handle this kind of inputs or that It Is just different cause It Is only a router on optimizers/layers.

But for the first part probably I don't understand what copy_with_decoupled_weight_decay will return for other optimizers?

I think that the base class for Multioptimizer is optimizer.
But if we don't have other use cases in mind for the layer/optimizer input other then Multioptimizer "wrapper/routing" activity It could be plausibile to maintain it as Is.

I think its feasible to maintain Multioptimizer as is. If we wish to implement more optimizers that wrap multiple optimizers, we can create a generic MultiOptimizerWrapper that could be used for all of them.

But for the first part probably I don't understand what copy_with_decoupled_weight_decay will return for other optimizers?

copy_with_decoupled_weight_decay would roughly look like this (the example uses DecoupledWeightDecayExtension which is defined here):

def copy_with_decoupled_weight_decay(optimizer, weight_decay):
  class OptimizerWithDecoupledWeightDecay(DecoupledWeightDecayExtension, optimizer.__class__):
    pass

  config = optimizer.get_config()
  config['weight_decay'] = weight_decay
  return OptimizerWithDecoupledWeightDecay.from_config(config)    

copy_with_decoupled_weight_decay copies the optimizer with get_config and from_config, but the new optimizer also subclasses from DecoupledWeightDecayExtension to get the extra functionality.

Hi @reedwm, thanks for bringing this up. I would prefer option 2. BTW, do you have the timeline that the new loss scale optimizer API will land in OSS TensorFlow? We usually pin to 2.4rc few days after it releases, so we want to have some time to prepare in advance. If you don't mind, we can discuss or even reveal the designated API here. Thank you!

Gentle ping to @tensorflow/sig-addons-maintainers because of API change. We should also update the related documentation as well.

Also, I want to notice that because this is a huge API change, and combined with https://github.com/tensorflow/addons/issues/2122, we can discuss whether we should bump our major version to 1.0.0 to inform users that there are no more backward compatibility.

The 2.4 branch is being cut in about two weeks. At that point, we cannot make API changes to TensorFlow anymore, but there will still be a few weeks after that before it is released.

If we do end up creating a helper class like OptimizerWrapper, it will likely not make it to TF 2.4 but instead TF 2.5. Which means TF Addons won't be able to use it until 2.5 is released. But the OptimizerWrapper can always be implemented in Addons first. In any case, I don't think we need to rush to make any API changes to Addons, but we do need to hurry in deciding the TensorFlow LossScaleOptimizer API as the TF 2.4 branch is being cut soon.

/CC @omalleyt12 any thoughts between options 1 and 2?

Thanks for opening this discussion!

My preference is for option (1). Conceptually, what these optimizers do is "Delegate most logic to a user-provided optimizer (of any class), while adding additional functionality". Composition seems like the best pattern for this.

IMO, option (2) falls too much into the "magic" realm. A few issues I think we'd run into with approach (2) off the top of my head:

  1. What happens to the original optimizer when a user does make_dynamic_subclass(my_optimizer)?

If there is no connection b/t the state of the my_optimizer instance and the new instance, this will be surprising to users. In that case, why not use the class name rather than an instance? But if we use the class name, then it becomes a pain to pass parameters to the dynamic subclass

  1. Two optimizers of the same "dynamic class" won't have the same __class__, which will break isinstance checks

  2. Extending the dynamic subclasses will be more difficult. Rather than subclassing the optimizer, the user will have to make their own function that creates a dynamic subclass, which seems a lot more difficult

Re option (1), composition of Layers is a fairly established pattern in Keras. For instance, the TimeDistributed and RNN layers use this pattern.

IMO when we have the use case "I want to delegate some functionality of my object to an object of type X, but I don't care what subclass of X is used," this is a classic use case for composition

Also, in context of Model.fit, where we auto-wrap the user's optimizer with LossScalingOptimizer, only option (1) would work.

Consider this scenario:

opt = SGD(1e-3)
model.compile(opt, ...)
model.fit(...)
model.compile(opt, ...)
model.fit(...)

If we created a dynamic subclass in compile, then any changes to the learning_rate that occurred during the first training run would not be reflected in the second training run, even though the user would expect that they would

Hmm you are right the auto-wrapping with LossScaleOptimizer works very poorly with model.compile, especially if model.compile is called multiple times. I think we would have to remove the autowrapping functionality if we go with option (2). I personally think we should remove the autowrapping functionality regardless, but many people disagree so I kept that functionality in the mixed precision RFC.

I prefer (2) since then users can access normal non-hyperparameter attributes on the wrapped optimizer. E.g.:

opt = tf.keras.optimizers.SGD()
opt = create_dynamic_subclass(opt)
opt.nesterov = True  # Only works if (2) is used. For (1), no error is thrown but nesterov is still not used

If there is no connection b/t the state of the my_optimizer instance and the new instance, this will be surprising to users

This is surprising but I doubt most users would notice in practice. As long as they stop using the old optimizer by doing something like opt = make_dynamic_subclass(opt), then they won't notice.

Two optimizers of the same "dynamic class" won't have the same __class__, which will break isinstance checks

We can cache the dynamic subclass, so if two optimziers of the same type are passed to make_dynamic_subclass, the two returned optimizers also have the same type.

Extending the dynamic subclasses will be more difficult. Rather than subclassing the optimizer, the user will have to make their own function that creates a dynamic subclass, which seems a lot more difficult

This is difficult, but users don't have to create their own function that creates a dynamic subclass. They can do something like

class MySubclass(make_dynamic_subclass(tf.keras.optimizers.SGD()).__class__):
  ... # define class here

This is ugly, but I don't think users will want to extend the dynamic subclass frequently.

IMO when we have the use case "I want to delegate some functionality of my object to an object of type X, but I don't care what subclass of X is used," this is a classic use case for composition

I don't think this is a classic use case for composition since users may access attributes of the subclass of X that are not defined on the type X itself. For example, users may access the nesterov attribute only defined on SGD. AFAIK, there is no precedent for either the composition approach or the dynamic subclass approach if subclasses expose extra attributes.

This is surprising but I doubt most users would notice in practice. As long as they stop using the old optimizer by doing something like opt = make_dynamic_subclass(opt), then they won't notice.

Can we try to limit this only to "not used" optimizer?

class MySubclass(make_dynamic_subclass(tf.keras.optimizers.SGD()).__class__):
... # define class here

This Is very ugly.. I hope It Is not a real use case.

So from what I can tell, the goal here is to make optimizer wrappers better by addressing two issues.

  1. When minimizing, the optimizer wrapper's method or optimizer's method is not called
  2. When serializing and de-serializing, the wrapped optimizer must be restored correctly.

I think we can look at an alternative approach by standardizing what optimizer wrappers are allowed to do.

From what I can tell, an optimizer wrapper does one or more of the following things.

  1. Pre gradient application variable modification
    a) Filtering of variables (gradients) before application (multioptimizer, weight decay optimizer)
    b) Modification of variables (gradients) before application (average optimizer, lookahead)
  2. Pre or Post gradient application variable storage
    a) Storage and maintenance of weights (moving average)
    b) Storage and maintenance of variables necessary for pre gradient application functionality (most wrappers)
  3. Post gradient application variable modification
    a) Modification of weights (stochastic weight averaging)

I believe the commonality among all optimizer wrappers is that they do not modify the method for applying the gradients. As in, every wrapper wants to preserve the wrapped optimizer's behaviors, while implementing additional behaviours pre and post application of gradients.

With this in mind, I propose that we create an optimizer wrapper class that stores the original optimizer in a standardized manner to address the two issues above.

First, we standardize how wrappers wrap optimizers. Each wrapper implements apply_gradients that calls the wrapped optimizer's apply gradients. All wrapper behavior happens before and/or after the the wrapped optimizer's apply gradients step. Effectively, every optimizer wrapper will have this method defined.

def apply_gradients(kwargs): 
      # I believe post apply of step 1 can be performed pre apply of step 2, unless its literally the last step of the training run
      modded_kwargs = self._pre_apply_stuff(kwargs)
      return self.opt.apply_gradients(modded_kwargs)

Second, we always instantiate the optimizer, then pass it to the optimizer wrapper. The wrapper must implement a nested serialize and deserialize (like in multi optimizer). This allows for both the wrapper and original optimizer to be serialized and deserialized without creating new subclasses on the fly.

Third, we shouldn't allow nor design for manual non apply_gradient method calls to optimizers and wrappers. An optimizer is intended to work in a fit loop that constantly calls apply_gradients. There may be conditionals on how gradients are applied, but at every step of training, apply_gradients is called. If there are behaviors that occur every x steps or on some conditional, that should be implemented inside the apply_gradients method and variables should be used to track when to call those behaviors, when the apply_gradients method is called. (I believe dense and sparse apply are called as part of the apply_gradients method, but if you are modifying behavior within the apply gradient method then you aren't building a wrapper. It doesn't "wrap" a function if it is modifying things within the function.)

In short, I believe that the class wrapping approach is more readable, easier to maintain, and less vulnerable to changes in base TF. If we all agree that optimizers are used in loops that call their apply method every step, then we can standardize how optimizer wrappers behave to prevent any occurrence of an missed method call. Finally, serialization is necessary for saving models and many other TF functions. The wrapper class approach addresses all the issues so long as we all agree on how optimizers are supposed to work.

P.S.

I think we can somehow delegate attribute calls to the original optimizer or make it clear that if you want to assign attributes to the optimizer, assign it to the optimizer and not the optimizer wrapper.

Also, if we use method 2 of subclassing, we will never be able to wrap twice. This is because subclassing overrides the apply_gradients method so doing it twice will give you the most recent wrapper.

If we use method 1, we can wrap twice, because wrapper 1 calls apply gradients of wrapper 2 that calls apply gradients of the optimizer. Each wrapper does its pre/post apply behaviors before passing it on the to next.

So... How can we get the final conclusion here?

Was this page helpful?
0 / 5 - 0 ratings