Models: [Feature request] Documentation for Object Detection API GPU parameters

Created on 17 Jul 2018  路  4Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: /tensorflow/models/research/object_detection/
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): conda tensorflow-gpu
  • TensorFlow version (use command below): 1.8.0
  • Bazel version (if compiling from source): not compiled from source
  • CUDA/cuDNN version: 9.0/7?
  • GPU model and memory: (4) GTX 1080 TI
  • Exact command to reproduce: Feature request

Describe the problem

Normally to train on all 4 gpus in my single server with object detection models, I use the parameters --num_clones=4 --ps_tasks=1 when calling train.py. However with a few models (ex. NasNet, RetinaNet), increasing num_clones>1 errors out (ValueError: ('In Synchronous SGD mode num_clones must ', 'be 1. Found num_clones: 4')). I tried every combination I could think of including the other parameter worker_replicas and modifying batch_size, but nothing I do will utilize more than the first GPU.

I can't find any documentation for how to set this up, so this would be really helpful to include for beginners.

EDIT: just noticed that train.py is now in legacy/ so I can no longer use multi GPU for _any_ model, not just the ones listed. I created a new issue for this.

research support

Most helpful comment

If you set sync_replicas: false in your config, this problem should go away. Please try it out and let us know.

All 4 comments

If you set sync_replicas: false in your config, this problem should go away. Please try it out and let us know.

Okay thanks will try this out on Monday.

@tombstone This works for me, thanks (using the train.py in legacy/ however). Could you explain what turning off sync_replicas does? I can't seem to find any documentation anywhere (I hope it doesn't just mean that two separate models are trained side by side)

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings