Addons: MovingAverage does not work with MirroredStrategy

Created on 18 Oct 2019  路  16Comments  路  Source: tensorflow/addons

System information

  • OS Platform and Distribution: Ubuntu 18.04
  • TensorFlow version and how it was installed (source or binary): 2.0 from PyPi
  • TensorFlow-Addons version and how it was installed (source or binary): 0.5.2 from PyPi
  • Python version: 3.6

Describe the bug

If I compile a Keras model with a MovingAverage optimizer and a LearningRateScheduler, I get an error "Optimizer must have a "lr" attribute." at tensorflow_core/python/keras/callbacks.py:1342. I can fix that by the following code:

@keras_utils.register_keras_custom_object
class LRMovingAverage(tfa.optimizers.MovingAverage):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @property
    def lr(self):
        return self._optimizer.lr

However, my model is compiled under tf.distribute.MirroredStrategy().scope() and I crash in fit():

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 681, in on_epoch
    yield epoch_logs
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(model, x, y, sample_weights))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
    fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
    coord.join(threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 190, in _call_for_each_replica
    **merge_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 446, in_distributed_apply
    ds_reduce_util.ReduceOp.SUM, grads_and_vars)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1481, in batch_reduce_to
    return self._batch_reduce_to(reduce_op, value_destination_pairs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 707, in _batch_reduce_to
    value_destination_pairs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 317, in batch_reduce
    value_destination_pairs[0][0].values) == 1:
IndexError: list index out of range

Code to reproduce the issue

TODO

bug help wanted optimizers

Most helpful comment

I will ping here if I have problems because issue authors cannot reopen their issues on GitHub if they were closed by maintainers.

All 16 comments

@PhilJd Hi, Phil, could you take a look? Thanks

@vmarkovtsev Hi, Vadim, can you provide a minimal reproducible example? Thank you

@facaiy Sure

#!/usr/bin/env python3
import sys
import tensorflow as tf
import tensorflow_addons as tfa


def main():
    batch_size = 12
    features_shape = 372, 558, 3
    labels = 10
    sample = tf.random.uniform(features_shape)

    def with_shape(t, shape):
        t = tf.squeeze(t)
        t.set_shape(shape)
        return t

    ds_train = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).map(lambda s, l: (with_shape(s, (batch_size,) + features_shape),
                                                      with_shape(l, (batch_size, labels))))
    ds_val = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).take(10).map(
        lambda s, l: (with_shape(s, (batch_size,) + features_shape), with_shape(l, (batch_size, labels))))
    with tf.distribute.MirroredStrategy().scope():
        model = tf.keras.applications.DenseNet121(
            weights=None, input_shape=features_shape, classes=labels)
        model.build((batch_size,) + features_shape)
        model.summary()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        optimizer = tfa.optimizers.MovingAverage(optimizer, average_decay=0.9999)
        cross_entropy = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
        model.compile(optimizer=optimizer, loss=cross_entropy, metrics=["accuracy"])
    model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)


if __name__ == "__main__":
    sys.exit(main())

cc @dubey @guptapriya - probably this is an upstream problem

@facaiy: I think @Squadrick might be more familiar with it as he implemented it?

@vmarkovtsev I get the error of TFA 0.5.2 but building and running TFA doesn't give me an error. Could you test it with the latest TFA version?

I tried 0.6.0 and it did not work. Then I tried pip install git+https and the build failed with stub.cc: No such file or directory. I cannot install tfa-nightly because it requires tf-nightly which is 2.1.0 and it breaks my world. So nope, I cannot test that myself, sorry.

@Squadrick I will be happy to install and test a wheel for Python 3.6 if you are able to build it and attach here.

I tried 0.6.0 and it did not work. Then I tried pip install git+https and the build failed with stub.cc: No such file or directory. I cannot install tfa-nightly because it requires tf-nightly which is 2.1.0 and it breaks my world. So nope, I cannot test that myself, sorry.

Hi @VladimirStarostenkov could you try installing pip install tfa-nightly --no-deps so there is no requirement for tf-nightly?

@seanpmorgan not sure if that helps, but I was able to reproduce it.

```vladimir@vladmsi:~/tf-additions$ python3 --version
Python 3.6.8
vladimir@vladmsi:~/tf-additions$ python3 -m venv ./env
vladimir@vladmsi:~/tf-additions$ source env/bin/activate
(env) vladimir@vladmsi:~/tf-additions$ pip install --upgrade pip
...
Successfully installed pip-19.3.1


(env) vladimir@vladmsi:~/tf-additions$ pip install tfa-nightly --no-deps
Collecting tfa-nightly
Downloading https://files.pythonhosted.org/packages/69/1d/782a3dcc8690b76f15f6c3abd7928986848b1d7dcbcf46887209b57f044b/tfa_nightly-0.7.0.dev20191103-cp36-cp36m-manylinux2010_x86_64.whl (1.9MB)
|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1.9MB 1.1MB/s
Installing collected packages: tfa-nightly
Successfully installed tfa-nightly-0.7.0.dev20191103


(env) vladimir@vladmsi:~/tf-additions$ pip install tensorflow
Collecting tensorflow
Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 86.3MB 2.7MB/s
...
ERROR: tfa-nightly 0.7.0.dev20191103 requires tf-nightly, which is not installed.
...
Successfully installed absl-py-0.8.1 astor-0.8.0 cachetools-3.1.1 certifi-2019.9.11 chardet-3.0.4 gast-0.2.2 google-auth-1.6.3 google-auth-oauthlib-0.4.1 google-pasta-0.1.7 grpcio-1.24.3 h5py-2.10.0 idna-2.8 keras-applications-1.0.8 keras-preprocessing-1.1.0 markdown-3.1.1 numpy-1.17.3 oauthlib-3.1.0 opt-einsum-3.1.0 protobuf-3.10.0 pyasn1-0.4.7 pyasn1-modules-0.2.7 requests-2.22.0 requests-oauthlib-1.2.0 rsa-4.0 setuptools-41.6.0 six-1.12.0 tensorboard-2.0.1 tensorflow-2.0.0 tensorflow-estimator-2.0.1 termcolor-1.1.0 urllib3-1.25.6 werkzeug-0.16.0 wheel-0.33.6 wrapt-1.11.2


(env) vladimir@vladmsi:~/tf-additions$ python moving_average.py
2019-11-04 09:54:57.279453: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-04 09:54:57.306281: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-11-04 09:54:57.307197: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4383400 executing computations on platform Host. Devices:
2019-11-04 09:54:57.307212: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:Entity . at 0x7f2dffc44c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: expected exactly one node node, found []
WARNING:tensorflow:There is non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
Model: "densenet121"


Layer (type) Output Shape Param # Connected to

...

Total params: 7,047,754
Trainable params: 6,964,106
Non-trainable params: 83,648


Train for 100 steps, validate for 10 steps
1/100 [..............................] - ETA: 4:42Traceback (most recent call last):
File "moving_average.py", line 37, in
sys.exit(main())
File "moving_average.py", line 33, in main
model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(args, *kwds)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
args, *kwds))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(func_args, *func_kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
return weak_wrapped_fn().__wrapped__(args, *kwds)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
per_replica_function, args=(model, x, y, sample_weights))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
fn, args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(self._exc_info_to_raise)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 190, in _call_for_each_replica
*
merge_kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 446, in _distributed_apply
ds_reduce_util.ReduceOp.SUM, grads_and_vars)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1481, in batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 707, in _batch_reduce_to
value_destination_pairs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 317, in batch_reduce
value_destination_pairs[0][0].values) == 1:
IndexError: list index out of range
`` If I installtensorflowfirst, the result does not change. The only difference is, I don't getERROR: tfa-nightly 0.7.0.dev20191103 requires tf-nightly, which is not installed.`

The code runs fine on my local machine. Pulled the latest master and built it from scratch on Google Colab and it runs without errors as well.

Link: https://colab.research.google.com/drive/17dYDWJJo7vJOAoO6JCSR-BBwlPWH1fKM

@Squadrick Colab works because this requires multiple devices.

@vmarkovtsev I was able to recreate the error on Colab with the same hardware as before (no accelerators) and no multiple devices.

I used tensorflow_addons==0.6.0 and tensorflow==2.0.0 instead of the tfa_nightly and tf_nightly.

Link to recreated error: https://colab.research.google.com/drive/1VFzf57e5v6awNi_Y4edFeH4t3GzPFL96

Great, so since @VladimirStarostenkov reproduced it with tfa_nightly and tensorflow==2.0, I can conclude that upgrading tensorflow to the future 2.1 should fix the problem.

@vmarkovtsev Closing this issue, feel free to reopen it if you run into any more problems.

I will ping here if I have problems because issue authors cannot reopen their issues on GitHub if they were closed by maintainers.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ididhmc picture ididhmc  路  4Comments

seanpmorgan picture seanpmorgan  路  3Comments

seanpmorgan picture seanpmorgan  路  4Comments

seanpmorgan picture seanpmorgan  路  4Comments

seanpmorgan picture seanpmorgan  路  4Comments