/tensorflow/models/research/object_detection/Normally to train on all 4 gpus in my single server with object detection models, I use the parameters --num_clones=4 --ps_tasks=1 when calling train.py. However with a few models (ex. NasNet, RetinaNet), increasing num_clones>1 errors out (ValueError: ('In Synchronous SGD mode num_clones must ', 'be 1. Found num_clones: 4')). I tried every combination I could think of including the other parameter worker_replicas and modifying batch_size, but nothing I do will utilize more than the first GPU.
I can't find any documentation for how to set this up, so this would be really helpful to include for beginners.
EDIT: just noticed that train.py is now in legacy/ so I can no longer use multi GPU for _any_ model, not just the ones listed. I created a new issue for this.
If you set sync_replicas: false in your config, this problem should go away. Please try it out and let us know.
Okay thanks will try this out on Monday.
@tombstone This works for me, thanks (using the train.py in legacy/ however). Could you explain what turning off sync_replicas does? I can't seem to find any documentation anywhere (I hope it doesn't just mean that two separate models are trained side by side)
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Most helpful comment
If you set
sync_replicas: falsein your config, this problem should go away. Please try it out and let us know.