Addons: Missing argument in apply_gradients() in AdamW optimizer

Created on 9 Mar 2020  路  22Comments  路  Source: tensorflow/addons

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow version and how it was installed (source or binary): tf-nightly 2.2.0-dev20200309 (pip install)
  • TensorFlow-Addons version and how it was installed (source or binary): 0.8.3 (pip install)
  • Python version: 3.6
  • Is GPU used? (yes/no): yes

Describe the bug

When running model.compile() with AdamW optimizer, a type error is thrown saying:
apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'

This can be fixed by adding in the argument to apply_gradients() in tensorflow_addons/optimizers/weight_decay_optimizers.py

Code to reproduce the issue

https://colab.research.google.com/drive/1A6X8yYii5M8BDqwAFvoFglrTIqWvwQLm

Other info / logs

TypeError: in user code:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
    outputs = self.distribute_strategy.experimental_run_v2(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:920 experimental_run_v2  **
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2254 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2615 _call_for_each_replica
    return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
    _minimize(tape, self.optimizer, loss, self.trainable_variables)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1737 _minimize
    all_reduce_sum_gradients=False)

TypeError: apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'
TF2.2-Compatibility bug

Most helpful comment

You code works with tensorflow==2.1.0 and tfa-nightly.

We currently target the stable version of tensorflow (2.1.0) to, as you can see, keep the sanity of our devs. So we don't expect everything to work with tf-nightly.

But what you are reporting is concerning. Either :

  • the bug comes from tensorflow and they're not being backward compatible there
  • We use some undocumented/private API in AdamW and we should remove them.

In all cases let's keep this issue open.

All 22 comments

You code works with tensorflow==2.1.0 and tfa-nightly.

We currently target the stable version of tensorflow (2.1.0) to, as you can see, keep the sanity of our devs. So we don't expect everything to work with tf-nightly.

But what you are reporting is concerning. Either :

  • the bug comes from tensorflow and they're not being backward compatible there
  • We use some undocumented/private API in AdamW and we should remove them.

In all cases let's keep this issue open.

@philjd could you look into it? It would be nice to fix it before the 0.9 release.

Thanks for flagging this. I just checked, the signature of the apply_gradients function has changed indeed. PR #1181 was an attempt to fix this but broke backward compatibility. I'll prepare a PR in a minute.

Thanks a lot. One thing I don't understand is that if when doing the subclassing, we used only public APIs, the breakage shouldn't happend right?

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer#write_a_customized_optimizer_2

It seems we shouldn't override apply_gradients.

The problem is that we need to pass a list with the variables to decay. I put this into apply_gradients and not into e.g. constructor as the variables to decay are also passed there.

Atm I don't see a backward compatible way of getting rid of overwriting the apply_gradients function.
I think it's easier to add **kwargs to the function signature, allowing us to be backward compatible.

Yep you're right. For now, let's fix the compatibility with TF 2.2.x with **kwargs. Then we can look into how to use only public API to reimplement it. If we don't do it at some point, we'll have to update this file at every TF release haha

Hm, technically we're just wrapping a public API, which seems perfectly valid to me.
I'd argue that the problem is on the Keras side, with an optimizer wrapping the public API in an incompatible way, requiring special treatment in the training loop here.

We shouldn't override methods which are public and not meant to be overwritten. For example, if method x is public and not meant to be overwritten, and we overwrite it anyway, adding a new argument y, it's working for the moment. Then, since x is a public API, the TF team can also add a new argument z. Then we're in conflict on the API level with what does tensorflow.

But yeah the piece of code you're showing is unfortunate. There are some issues on both side. We can discuss at the meeting today if using composition over subclassing is possible, and if yes, then we may want to go this way and break backward compatibility.

We should discuss this one for sure.

Hm, I just realized I took a look at the wrong snapshot for optimizer (2.1) and didn't get the full picture, sorry about that.
I guess we can get a long way with **kwargs (I think it's likely the TF team will only add new arguments with a default, which wouldn't require changes here) but I'd be happy to approach this differently if backward compatibility is less of a concern.

Yep, so we discussed this at the SIG keras and SIG addons meeting, the decision was that we do quick fixes and keep backward compatibility until the new RFC for keras optimizers is created and implemented. Once it's implemented, we'll refactor everything, possibly do backward incompatible changes and make everything compliant with the new API.

@PhilJd As said during the sig keras meeting, there is going to be an RFC concerning the new API for optimizers in https://github.com/tensorflow/community/pulls . When it's implemented, we'll likely refactor the weight decay using this API, could you take a look and get involved in this RFC when it's out?

Sure, feel free to ping me if I happen to miss the announcement :)

we still face the issue with TensorFlow version and how it was installed (source or binary): tf-nightly 2.2.0-dev20200408 (pip install) and stable version 2.1

Which tensorflow-addons version are you using? I guess it's in tfa-nightly only so far.

Should work with tensorflow addons 0.9.x too

I'm getting a similar error,

    TypeError: tf__apply_gradients() got an unexpected keyword argument 'experimental_aggregate_gradients'

This is from attempting a custom model

from tensorflow.python.keras.mixed_precision.experimental import loss_scale_optimizer as lso
from tensorflow.python.distribute import parameter_server_strategy

def _minimize(strategy, tape, optimizer, loss, trainable_variables):
    with tape:
        if isinstance(optimizer, lso.LossScaleOptimizer):
            loss = optimizer.get_scaled_loss(loss)

    gradients = tape.gradient(loss, trainable_variables)
    gradients = [(ClipIfNotNone(grad)) for grad in gradients]
    gradients = [(ClipIfNotNone2(grad)) for grad in gradients]
    # Whether to aggregate gradients outside of optimizer. This requires support
    # of the optimizer and doesn't work with ParameterServerStrategy and
    # CentralStroageStrategy.
    aggregate_grads_outside_optimizer = (
        optimizer._HAS_AGGREGATE_GRAD and  # pylint: disable=protected-access
        not isinstance(strategy.extended,
                        parameter_server_strategy.ParameterServerStrategyExtended))

    if aggregate_grads_outside_optimizer:
        # We aggregate gradients before unscaling them, in case a subclass of
        # LossScaleOptimizer all-reduces in fp16. All-reducing in fp16 can only be
        # done on scaled gradients, not unscaled gradients, for numeric stability.
        gradients = optimizer._aggregate_gradients(zip(gradients,  # pylint: disable=protected-access
                                                    trainable_variables))
    if isinstance(optimizer, lso.LossScaleOptimizer):
        gradients = optimizer.get_unscaled_gradients(gradients)
    gradients = optimizer._clip_gradients(gradients)  # pylint: disable=protected-access
    if trainable_variables:
        if aggregate_grads_outside_optimizer:
            optimizer.apply_gradients(
                zip(gradients, trainable_variables),
                experimental_aggregate_gradients=False)
        else:
            optimizer.apply_gradients(zip(gradients, trainable_variables))

class CustomModel(tf.keras.Model):
    def train_step(self, data):
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x = data
        y = tf.constant([1.0], dtype=tf.float32)

        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)  # Forward pass
            # Compute the loss value
            # (the loss function is configured in `compile()`)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)

        _minimize(self.distribute_strategy, tape, self.optimizer, loss,
                self.trainable_variables)

        self.compiled_metrics.update_state(y, y_pred, sample_weight)
        return {m.name: m.result() for m in self.metrics}

This is with the AdamW optimizer as well.

@Santosh-Gupta you need **kwargs. Check https://github.com/tensorflow/addons/pull/1924

@bhack where do the **kwargs go? I'm looking at https://github.com/tensorflow/addons/pull/1924 but still not sure. Is it at _minimize? so def _minimize(strategy, tape, optimizer, loss, trainable_variables, **kwargs):?

Installing the latest tfa add-ons seemed to fix the issue, though now I'm getting an AttributeError: Tensor.name is meaningless when eager execution is enabled. error, which I believe is related to training on TPUs and not the AdamW optimizer.

I solved it, Just upgrade to new version tfa 0.10.0
pip install tensorflow_addons==0.10.0

Was this page helpful?
0 / 5 - 0 ratings

Related issues

seanpmorgan picture seanpmorgan  路  4Comments

ididhmc picture ididhmc  路  4Comments

seanpmorgan picture seanpmorgan  路  3Comments

seanpmorgan picture seanpmorgan  路  3Comments

seanpmorgan picture seanpmorgan  路  4Comments