Models: Estimators for training: Multi GPU Support seems missing

Created on 15 Jul 2018  路  15Comments  路  Source: tensorflow/models

Feature Request:

Estimators seem to make data parallelism much easier with the replicate_model_fn and TowerOptimizer decorators. This doesn't seem to be included in the Estimator definitions at model_lib.py.

Could Multi GPU use be clarified (if already present)?

For my present use case, I happen to be modifying the model_lib.py defention with the decorators to accommodate tower cloning.


_System Information doesn't seem relevant to this, but included nevertheless_

System information

  • What is the top-level directory of the model you are using: model/research
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, but not with any dependencies with the Estimator defenitions.
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux, 16.04 (ubuntu)
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.8.0
  • Bazel version (if compiling from source): 0.14.0
  • CUDA/cuDNN version: 8.0/6.0
  • GPU model and memory: 1080 Ti, 12 GB
  • Exact command to reproduce: python object_detection/model_main.py
awaiting maintainer

Most helpful comment

Yes, you simply pass a distribution strategy to tf.estimator.RunConfig() and the estimator handles the rest when passed a config with distribution set. Currently only OneDeviceStrategy (single CPU or GPU) and MirroredStrategy (Multi GPU, single node) are implemented, but more are in development. official/resnet, official/wide_deep, and official/transformer all use this API, so you can check them for details.

It doesn't appear that research/object detection uses DistributionStrategies right now.

All 15 comments

The recommended API for parallelizing estimators is Distribution Strategies.

https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/distribute

and for examples in official models you can follow:

https://github.com/tensorflow/models/blob/master/official/utils/misc/distribution_utils.py

Estimators were meant to make scaling much easier than the tf.contrib.distribute api.
Does the former work with estimators too?

And are these (whether decorators or function wrappers ) baked into the object detection api (akin to the previous train script based on slim.training allowing for multiple clones)?

Yes, you simply pass a distribution strategy to tf.estimator.RunConfig() and the estimator handles the rest when passed a config with distribution set. Currently only OneDeviceStrategy (single CPU or GPU) and MirroredStrategy (Multi GPU, single node) are implemented, but more are in development. official/resnet, official/wide_deep, and official/transformer all use this API, so you can check them for details.

It doesn't appear that research/object detection uses DistributionStrategies right now.

Thanks! That's helpful.

Keeping this open however as a feature request. (for the api)

@varun19299 if the changes are simple, would you mind sharing your modifications to support multi-gpu training?

@robieta so does this mean the update to estimator-based object detection effectively removed multi GPU support?

No, it just means that that isn't how they are implementing multi-gpu support.

Hmm when switching to estimator-based training there seems to no longer be options for how to select number of GPUs with the new model_main.py as there was in the past with legacy/train.py

@pkulzc

@varun19299 Distribution strategies that @robieta mentioned currently do not work with models constructed using tf.contrib.slim layers; and all models in the Tensorflow Object Detection API use tf.contrib.slim

We are evaluating changing model construction to be based on tf.layers or tf.keras after which we should be able to support all distribution strategies.

For now we only support

  1. TPU training via model_tpu_main.py
  2. Multiworker asynchronous GPU training via model_main.py

It would be great if these two could be clarified:

  • @tombstone if I'm not wrong tf.contrib.slim has moved it's layer's to tf.contrib.layers and slim.argscope to tf.contrib.framework.argsope. These do work with estimators (I've tried the decorators similar to the mnist example at models/examples).

Also, going by issue #16182 on tensorflow/tensorflow, I thought there were plans in works to shift to these two or better supported APIs. (slim was supposed to be deprecated soon).

  • Regarding

Multiworker asynchronous GPU training via model_main.py

could the same be shown as an example. I'm not sure the current model_main supports the clone based data parallelism (not sure which distribution strategy slim uses, but assuming it is asynchronous. Certainly not as vast as what Estimators have) in the current code.

Could you please clarify these? Thanks a lot!

It would be great if these two could be clarified:

@tombstone if I'm not wrong tf.contrib.slim has moved it's layer's to tf.contrib.layers and slim.argscope to tf.contrib.framework.argsope. These do work with estimators (I've tried the decorators similar to the mnist example at models/examples).

You are right. But it is the new distribution strategies in estimators that don't work well with tf.contrib.layers or tf.contrib.slim

Also, going by issue #16182 on tensorflow/tensorflow, I thought there were plans in works to shift to these two or better supported APIs. (slim was supposed to be deprecated soon).

Regarding
Multiworker asynchronous GPU training via model_main.py

could the same be shown as an example. I'm not sure the current model_main supports the clone based data parallelism (not sure which distribution strategy slim uses, but assuming it is asynchronous. Certainly not as vast as what Estimators have) in the current code.
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md shows an example of running multi worker asynchronous jobs

If you were using the clone mechanism for single worker multi-gpu training before, please continue to use legacy/train.py. It should work.

Could you please clarify these? Thanks a lot!

You are right. But it is the new distribution strategies in estimators that don't work well with tf.contrib.layers or tf.contrib.slim

That's interesting. Any particular reason why? (It would be great if you could explain a bit as to how the backend for these decorators work, that's quite a dark area for me)

I had a partially similar question. Thanks for asking this @varun19299

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nmfisher picture nmfisher  路  3Comments

25b3nk picture 25b3nk  路  3Comments

jacknlliu picture jacknlliu  路  3Comments

hanzy123 picture hanzy123  路  3Comments

kamal4493 picture kamal4493  路  3Comments