Addons: Missing argument in apply_gradients() in AdamW optimizer

Created on 9 Mar 2020 · 22Comments · Source: tensorflow/addons

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
TensorFlow version and how it was installed (source or binary): tf-nightly 2.2.0-dev20200309 (pip install)
TensorFlow-Addons version and how it was installed (source or binary): 0.8.3 (pip install)
Python version: 3.6
Is GPU used? (yes/no): yes

Describe the bug

When running model.compile() with AdamW optimizer, a type error is thrown saying:
apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'

This can be fixed by adding in the argument to apply_gradients() in tensorflow_addons/optimizers/weight_decay_optimizers.py

Code to reproduce the issue

https://colab.research.google.com/drive/1A6X8yYii5M8BDqwAFvoFglrTIqWvwQLm

Other info / logs

TypeError: in user code:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
    outputs = self.distribute_strategy.experimental_run_v2(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:920 experimental_run_v2  **
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2254 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2615 _call_for_each_replica
    return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
    _minimize(tape, self.optimizer, loss, self.trainable_variables)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1737 _minimize
    all_reduce_sum_gradients=False)

TypeError: apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'

TF2.2-Compatibility bug

Source

eoinoreilly30

👍4

Most helpful comment

You code works with tensorflow==2.1.0 and tfa-nightly.

We currently target the stable version of tensorflow (2.1.0) to, as you can see, keep the sanity of our devs. So we don't expect everything to work with tf-nightly.

But what you are reporting is concerning. Either :

the bug comes from tensorflow and they're not being backward compatible there
We use some undocumented/private API in AdamW and we should remove them.

In all cases let's keep this issue open.

gabrieldemarmiesse on 9 Mar 2020

👍5

All 22 comments

You code works with tensorflow==2.1.0 and tfa-nightly.

We currently target the stable version of tensorflow (2.1.0) to, as you can see, keep the sanity of our devs. So we don't expect everything to work with tf-nightly.

But what you are reporting is concerning. Either :

the bug comes from tensorflow and they're not being backward compatible there
We use some undocumented/private API in AdamW and we should remove them.

In all cases let's keep this issue open.

gabrieldemarmiesse on 9 Mar 2020

👍5

@philjd could you look into it? It would be nice to fix it before the 0.9 release.

gabrieldemarmiesse on 3 Apr 2020

Thanks for flagging this. I just checked, the signature of the apply_gradients function has changed indeed. PR #1181 was an attempt to fix this but broke backward compatibility. I'll prepare a PR in a minute.

PhilJd on 3 Apr 2020

Thanks a lot. One thing I don't understand is that if when doing the subclassing, we used only public APIs, the breakage shouldn't happend right?

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer#write_a_customized_optimizer_2

It seems we shouldn't override apply_gradients.

gabrieldemarmiesse on 3 Apr 2020

The problem is that we need to pass a list with the variables to decay. I put this into apply_gradients and not into e.g. constructor as the variables to decay are also passed there.

Atm I don't see a backward compatible way of getting rid of overwriting the apply_gradients function.
I think it's easier to add **kwargs to the function signature, allowing us to be backward compatible.

PhilJd on 3 Apr 2020

Yep you're right. For now, let's fix the compatibility with TF 2.2.x with **kwargs. Then we can look into how to use only public API to reimplement it. If we don't do it at some point, we'll have to update this file at every TF release haha

gabrieldemarmiesse on 3 Apr 2020

Hm, technically we're just wrapping a public API, which seems perfectly valid to me.
I'd argue that the problem is on the Keras side, with an optimizer wrapping the public API in an incompatible way, requiring special treatment in the training loop here.

PhilJd on 3 Apr 2020

We shouldn't override methods which are public and not meant to be overwritten. For example, if method x is public and not meant to be overwritten, and we overwrite it anyway, adding a new argument y, it's working for the moment. Then, since x is a public API, the TF team can also add a new argument z. Then we're in conflict on the API level with what does tensorflow.

But yeah the piece of code you're showing is unfortunate. There are some issues on both side. We can discuss at the meeting today if using composition over subclassing is possible, and if yes, then we may want to go this way and break backward compatibility.

gabrieldemarmiesse on 3 Apr 2020

We should discuss this one for sure.

AakashKumarNain on 3 Apr 2020

Hm, I just realized I took a look at the wrong snapshot for optimizer (2.1) and didn't get the full picture, sorry about that.
I guess we can get a long way with **kwargs (I think it's likely the TF team will only add new arguments with a default, which wouldn't require changes here) but I'd be happy to approach this differently if backward compatibility is less of a concern.

PhilJd on 3 Apr 2020

Yep, so we discussed this at the SIG keras and SIG addons meeting, the decision was that we do quick fixes and keep backward compatibility until the new RFC for keras optimizers is created and implemented. Once it's implemented, we'll refactor everything, possibly do backward incompatible changes and make everything compliant with the new API.

gabrieldemarmiesse on 3 Apr 2020

@PhilJd As said during the sig keras meeting, there is going to be an RFC concerning the new API for optimizers in https://github.com/tensorflow/community/pulls . When it's implemented, we'll likely refactor the weight decay using this API, could you take a look and get involved in this RFC when it's out?

gabrieldemarmiesse on 4 Apr 2020

Sure, feel free to ping me if I happen to miss the announcement :)

PhilJd on 4 Apr 2020

👍3

we still face the issue with TensorFlow version and how it was installed (source or binary): tf-nightly 2.2.0-dev20200408 (pip install) and stable version 2.1

sehamra on 11 Apr 2020

Which tensorflow-addons version are you using? I guess it's in tfa-nightly only so far.

PhilJd on 11 Apr 2020

Should work with tensorflow addons 0.9.x too

gabrieldemarmiesse on 11 Apr 2020

I'm getting a similar error,

    TypeError: tf__apply_gradients() got an unexpected keyword argument 'experimental_aggregate_gradients'

This is from attempting a custom model

from tensorflow.python.keras.mixed_precision.experimental import loss_scale_optimizer as lso
from tensorflow.python.distribute import parameter_server_strategy

def _minimize(strategy, tape, optimizer, loss, trainable_variables):
    with tape:
        if isinstance(optimizer, lso.LossScaleOptimizer):
            loss = optimizer.get_scaled_loss(loss)

    gradients = tape.gradient(loss, trainable_variables)
    gradients = [(ClipIfNotNone(grad)) for grad in gradients]
    gradients = [(ClipIfNotNone2(grad)) for grad in gradients]
    # Whether to aggregate gradients outside of optimizer. This requires support
    # of the optimizer and doesn't work with ParameterServerStrategy and
    # CentralStroageStrategy.
    aggregate_grads_outside_optimizer = (
        optimizer._HAS_AGGREGATE_GRAD and  # pylint: disable=protected-access
        not isinstance(strategy.extended,
                        parameter_server_strategy.ParameterServerStrategyExtended))

    if aggregate_grads_outside_optimizer:
        # We aggregate gradients before unscaling them, in case a subclass of
        # LossScaleOptimizer all-reduces in fp16. All-reducing in fp16 can only be
        # done on scaled gradients, not unscaled gradients, for numeric stability.
        gradients = optimizer._aggregate_gradients(zip(gradients,  # pylint: disable=protected-access
                                                    trainable_variables))
    if isinstance(optimizer, lso.LossScaleOptimizer):
        gradients = optimizer.get_unscaled_gradients(gradients)
    gradients = optimizer._clip_gradients(gradients)  # pylint: disable=protected-access
    if trainable_variables:
        if aggregate_grads_outside_optimizer:
            optimizer.apply_gradients(
                zip(gradients, trainable_variables),
                experimental_aggregate_gradients=False)
        else:
            optimizer.apply_gradients(zip(gradients, trainable_variables))

class CustomModel(tf.keras.Model):
    def train_step(self, data):
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x = data
        y = tf.constant([1.0], dtype=tf.float32)

        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)  # Forward pass
            # Compute the loss value
            # (the loss function is configured in `compile()`)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)

        _minimize(self.distribute_strategy, tape, self.optimizer, loss,
                self.trainable_variables)

        self.compiled_metrics.update_state(y, y_pred, sample_weight)
        return {m.name: m.result() for m in self.metrics}

This is with the AdamW optimizer as well.

Santosh-Gupta on 28 Jun 2020

👍2

@Santosh-Gupta you need **kwargs. Check https://github.com/tensorflow/addons/pull/1924

bhack on 28 Jun 2020

@bhack where do the **kwargs go? I'm looking at https://github.com/tensorflow/addons/pull/1924 but still not sure. Is it at _minimize? so def _minimize(strategy, tape, optimizer, loss, trainable_variables, **kwargs):?

Santosh-Gupta on 28 Jun 2020

For AdamW https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L130-L181

bhack on 28 Jun 2020

👍1

Installing the latest tfa add-ons seemed to fix the issue, though now I'm getting an AttributeError: Tensor.name is meaningless when eager execution is enabled. error, which I believe is related to training on TPUs and not the AdamW optimizer.

Santosh-Gupta on 28 Jun 2020

I solved it, Just upgrade to new version tfa 0.10.0
pip install tensorflow_addons==0.10.0

montaserFath on 29 Jun 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

WeightNormalization data init fails with Keras experimental_run_tf_function

seanpmorgan · 4Comments

tensorflow2.0 can't use this module

ididhmc · 4Comments

Implement Sparsemax Activation and Loss

seanpmorgan · 3Comments

Clean up tutorials

seanpmorgan · 3Comments

AttentionWrapperTest results failing on nightlies

seanpmorgan · 4Comments