Addons: MovingAverage does not work with MirroredStrategy

Created on 18 Oct 2019 · 16Comments · Source: tensorflow/addons

System information

OS Platform and Distribution: Ubuntu 18.04
TensorFlow version and how it was installed (source or binary): 2.0 from PyPi
TensorFlow-Addons version and how it was installed (source or binary): 0.5.2 from PyPi
Python version: 3.6

Describe the bug

If I compile a Keras model with a MovingAverage optimizer and a LearningRateScheduler, I get an error "Optimizer must have a "lr" attribute." at tensorflow_core/python/keras/callbacks.py:1342. I can fix that by the following code:

@keras_utils.register_keras_custom_object
class LRMovingAverage(tfa.optimizers.MovingAverage):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @property
    def lr(self):
        return self._optimizer.lr

However, my model is compiled under tf.distribute.MirroredStrategy().scope() and I crash in fit():

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 681, in on_epoch
    yield epoch_logs
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(model, x, y, sample_weights))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
    fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
    coord.join(threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 190, in _call_for_each_replica
    **merge_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 446, in_distributed_apply
    ds_reduce_util.ReduceOp.SUM, grads_and_vars)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1481, in batch_reduce_to
    return self._batch_reduce_to(reduce_op, value_destination_pairs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 707, in _batch_reduce_to
    value_destination_pairs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 317, in batch_reduce
    value_destination_pairs[0][0].values) == 1:
IndexError: list index out of range

Code to reproduce the issue

TODO

bug help wanted optimizers

Source

vmarkovtsev

Most helpful comment

I will ping here if I have problems because issue authors cannot reopen their issues on GitHub if they were closed by maintainers.

vmarkovtsev on 8 Nov 2019

👍2

All 16 comments

@PhilJd Hi, Phil, could you take a look? Thanks

facaiy on 18 Oct 2019

@vmarkovtsev Hi, Vadim, can you provide a minimal reproducible example? Thank you

facaiy on 18 Oct 2019

@facaiy Sure

#!/usr/bin/env python3
import sys
import tensorflow as tf
import tensorflow_addons as tfa


def main():
    batch_size = 12
    features_shape = 372, 558, 3
    labels = 10
    sample = tf.random.uniform(features_shape)

    def with_shape(t, shape):
        t = tf.squeeze(t)
        t.set_shape(shape)
        return t

    ds_train = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).map(lambda s, l: (with_shape(s, (batch_size,) + features_shape),
                                                      with_shape(l, (batch_size, labels))))
    ds_val = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).take(10).map(
        lambda s, l: (with_shape(s, (batch_size,) + features_shape), with_shape(l, (batch_size, labels))))
    with tf.distribute.MirroredStrategy().scope():
        model = tf.keras.applications.DenseNet121(
            weights=None, input_shape=features_shape, classes=labels)
        model.build((batch_size,) + features_shape)
        model.summary()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        optimizer = tfa.optimizers.MovingAverage(optimizer, average_decay=0.9999)
        cross_entropy = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
        model.compile(optimizer=optimizer, loss=cross_entropy, metrics=["accuracy"])
    model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)


if __name__ == "__main__":
    sys.exit(main())

vmarkovtsev on 18 Oct 2019

cc @dubey @guptapriya - probably this is an upstream problem

vmarkovtsev on 18 Oct 2019

@facaiy: I think @Squadrick might be more familiar with it as he implemented it?

PhilJd on 18 Oct 2019

@vmarkovtsev I get the error of TFA 0.5.2 but building and running TFA doesn't give me an error. Could you test it with the latest TFA version?

Squadrick on 3 Nov 2019

I tried 0.6.0 and it did not work. Then I tried pip install git+https and the build failed with stub.cc: No such file or directory. I cannot install tfa-nightly because it requires tf-nightly which is 2.1.0 and it breaks my world. So nope, I cannot test that myself, sorry.

vmarkovtsev on 4 Nov 2019

@Squadrick I will be happy to install and test a wheel for Python 3.6 if you are able to build it and attach here.

vmarkovtsev on 4 Nov 2019

I tried 0.6.0 and it did not work. Then I tried pip install git+https and the build failed with stub.cc: No such file or directory. I cannot install tfa-nightly because it requires tf-nightly which is 2.1.0 and it breaks my world. So nope, I cannot test that myself, sorry.

Hi @VladimirStarostenkov could you try installing pip install tfa-nightly --no-deps so there is no requirement for tf-nightly?

seanpmorgan on 4 Nov 2019

@seanpmorgan not sure if that helps, but I was able to reproduce it.

```vladimir@vladmsi:~/tf-additions$ python3 --version
Python 3.6.8
vladimir@vladmsi:~/tf-additions$ python3 -m venv ./env
vladimir@vladmsi:~/tf-additions$ source env/bin/activate
(env) vladimir@vladmsi:~/tf-additions$ pip install --upgrade pip
...
Successfully installed pip-19.3.1

(env) vladimir@vladmsi:~/tf-additions$ pip install tfa-nightly --no-deps
Collecting tfa-nightly
Downloading https://files.pythonhosted.org/packages/69/1d/782a3dcc8690b76f15f6c3abd7928986848b1d7dcbcf46887209b57f044b/tfa_nightly-0.7.0.dev20191103-cp36-cp36m-manylinux2010_x86_64.whl (1.9MB)
|████████████████████████████████| 1.9MB 1.1MB/s
Installing collected packages: tfa-nightly
Successfully installed tfa-nightly-0.7.0.dev20191103

(env) vladimir@vladmsi:~/tf-additions$ pip install tensorflow
Collecting tensorflow
Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
|████████████████████████████████| 86.3MB 2.7MB/s
...
ERROR: tfa-nightly 0.7.0.dev20191103 requires tf-nightly, which is not installed.
...
Successfully installed absl-py-0.8.1 astor-0.8.0 cachetools-3.1.1 certifi-2019.9.11 chardet-3.0.4 gast-0.2.2 google-auth-1.6.3 google-auth-oauthlib-0.4.1 google-pasta-0.1.7 grpcio-1.24.3 h5py-2.10.0 idna-2.8 keras-applications-1.0.8 keras-preprocessing-1.1.0 markdown-3.1.1 numpy-1.17.3 oauthlib-3.1.0 opt-einsum-3.1.0 protobuf-3.10.0 pyasn1-0.4.7 pyasn1-modules-0.2.7 requests-2.22.0 requests-oauthlib-1.2.0 rsa-4.0 setuptools-41.6.0 six-1.12.0 tensorboard-2.0.1 tensorflow-2.0.0 tensorflow-estimator-2.0.1 termcolor-1.1.0 urllib3-1.25.6 werkzeug-0.16.0 wheel-0.33.6 wrapt-1.11.2

(env) vladimir@vladmsi:~/tf-additions$ python moving_average.py
2019-11-04 09:54:57.279453: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-04 09:54:57.306281: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-11-04 09:54:57.307197: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4383400 executing computations on platform Host. Devices:
2019-11-04 09:54:57.307212: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:Entity . at 0x7f2dffc44c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: expected exactly one node node, found []
WARNING:tensorflow:There is non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
Model: "densenet121"

Layer (type) Output Shape Param # Connected to

...

Total params: 7,047,754
Trainable params: 6,964,106
Non-trainable params: 83,648

Train for 100 steps, validate for 10 steps
1/100 [..............................] - ETA: 4:42Traceback (most recent call last):
File "moving_average.py", line 37, in
sys.exit(main())
File "moving_average.py", line 33, in main
model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(args, *kwds)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
args, *kwds))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(func_args, *func_kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
return weak_wrapped_fn().__wrapped__(args, *kwds)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
per_replica_function, args=(model, x, y, sample_weights))
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
fn, args, kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(self._exc_info_to_raise)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 190, in _call_for_each_replica
*merge_kwargs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 446, in _distributed_apply
ds_reduce_util.ReduceOp.SUM, grads_and_vars)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1481, in batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 707, in _batch_reduce_to
value_destination_pairs)
File "/home/vladimir/tf-additions/env/lib/python3.6/site-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 317, in batch_reduce
value_destination_pairs[0][0].values) == 1:
IndexError: list index out of range
``If I installtensorflowfirst, the result does not change. The only difference is, I don't getERROR: tfa-nightly 0.7.0.dev20191103 requires tf-nightly, which is not installed.`

VladimirStarostenkov on 4 Nov 2019

The code runs fine on my local machine. Pulled the latest master and built it from scratch on Google Colab and it runs without errors as well.

Link: https://colab.research.google.com/drive/17dYDWJJo7vJOAoO6JCSR-BBwlPWH1fKM

Squadrick on 7 Nov 2019

@Squadrick Colab works because this requires multiple devices.

vmarkovtsev on 7 Nov 2019

@vmarkovtsev I was able to recreate the error on Colab with the same hardware as before (no accelerators) and no multiple devices.

I used tensorflow_addons==0.6.0 and tensorflow==2.0.0 instead of the tfa_nightly and tf_nightly.

Link to recreated error: https://colab.research.google.com/drive/1VFzf57e5v6awNi_Y4edFeH4t3GzPFL96

Squadrick on 8 Nov 2019

Great, so since @VladimirStarostenkov reproduced it with tfa_nightly and tensorflow==2.0, I can conclude that upgrading tensorflow to the future 2.1 should fix the problem.

vmarkovtsev on 8 Nov 2019

@vmarkovtsev Closing this issue, feel free to reopen it if you run into any more problems.

Squadrick on 8 Nov 2019

I will ping here if I have problems because issue authors cannot reopen their issues on GitHub if they were closed by maintainers.

vmarkovtsev on 8 Nov 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

tensorflow2.0 can't use this module

ididhmc · 4Comments

Complete black formatting

seanpmorgan · 3Comments

Request for example: Weight Decay Optimizers / Super Convergence

seanpmorgan · 4Comments

Resolve discrepancy in python and custom op gelu implementations

seanpmorgan · 4Comments

AttentionWrapperTest results failing on nightlies

seanpmorgan · 4Comments