Models: Why multi-GPU training not supported for object detection?

Created on 18 Oct 2018  路  28Comments  路  Source: tensorflow/models

Is there any reason why multi-GPU training not supported?

research support

Most helpful comment

Really hoping this issue is prioritized highly, no multi-gpu training is a gigantic downside compared to other object detection frameworks! Legacy train.py also seems broken now.

All 28 comments

Some questions ask for you. How do you use multiple GPU operations? Which parameters need to be modified in train.py?

you can set the parameter num_clones=YOUR_GPU_NUMS

@roadcode Thinks for your help. It's effectual.

you can set the parameter num_clones=YOUR_GPU_NUMS

Thanks @roadcode ! Is it possible to use multiple gpus with the new model_main.py?

@wolfshow probably not, you can see this issue #5421

Please see my answer here.

Sorry to hijack this post
Hi @pkulzc ,
I'm using model_main.py for training. This file is evaluating for every 1200 steps but I want to evaluate for every 50k steps. I tried changing config file but no success. Could you please help me how to evaluate the model for every 50k steps?
My train set consists 700k images
val set consists 300k images
no.of.classes=100
Evaluation time is taking too much.
Thanks.

@pkulzc why not fix this problem, the new main_train.py not support multi-gpu when slim is not use anymore

@mlinxiang I didn't understand your question. For now you can only use multi-gpu with train.py. When we migrate to keras in the future, model_main.py will support multi-gpu as well.

Any progress on this task? Object detection training is bit useless on single GPU. Legacy train seems not working anymore.

This is still ongoing. Moving to keras is a huge effort so it'll take some more time.

Really hoping this issue is prioritized highly, no multi-gpu training is a gigantic downside compared to other object detection frameworks! Legacy train.py also seems broken now.

We have made some progress - now we have a few keras based models that is working and we are testing multi-gpu training. We will do a release once they are ready.

Thanks for the update @pkulzc

@pkulzc Hi. How's this going? Can we get access to the keras based models that are working?

great job~~

too slow ~~~

I've been trying the Keras models. With the configs I tried, the retinanet failed due to misnaming of the layers being targeted. And the FRCNN succeeded to train (single GPU) but then failed to compile for inference, due to named weights being incorectly named in the checkpoint (box predictor weights getting a '_1' appended to parts of the name.

Do you have configs that work? Ideally for Keras feature extractor flavour retinanet on multi GPU?

is multi-gpu training supporting now?

any updates?

Multi-GPU training of Faster/Mask R-CNN is supported in https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN, open sourced for 2+ years and was able to reproduce results in papers.

the following use only 1 gpu. But use legacy/train.py, can use multiple gpu.
python object_detection/model_main.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr
https://github.com/tensorflow/models/issues/6611

the following can use 4 gpu, with sync_replicas: False in pipeline config, and batchsize= pergpu x 4:
python object_detection/legacy/train.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

the following use only 1 gpu. But use legacy/train.py, can use multiple gpu.
python object_detection/model_main.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

6611

the following can use 4 gpu, with sync_replicas: False in pipeline config, and batchsize= pergpu x 4:
python object_detection/legacy/train.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

which branch did you use?

master

Looks like I am able to start multi-gpu training by setting num_workers to >1 in models/research/object_detection/model_main_tf2.py and nvidia-smi does show utilization. Will update soon about inference.

Looks like I am able to start multi-gpu training by setting num_workers to >1 in models/research/object_detection/model_main_tf2.py and nvidia-smi does show utilization. Will update soon about inference.

Btw, the checkpoint created through this way is only possible to be used with multi-gpu training. If I set num_workers to 1, I get same error as here: #8892

@aabbas90 what changes are required in the config file?

@aabbas90 what changes are required in the config file?

No changes in the config file w.r.t multi-gpu training. If single gpu training is working fine for you then only the steps I mentioned above are required.

Was this page helpful?
0 / 5 - 0 ratings