Models: Why multi-GPU training not supported for object detection?

Created on 18 Oct 2018 · 28Comments · Source: tensorflow/models

Is there any reason why multi-GPU training not supported?

research support

Source

wolfshow

👍3

Most helpful comment

Really hoping this issue is prioritized highly, no multi-gpu training is a gigantic downside compared to other object detection frameworks! Legacy train.py also seems broken now.

austinmw on 5 Apr 2019

👍9

All 28 comments

Some questions ask for you. How do you use multiple GPU operations? Which parameters need to be modified in train.py?

SdustZhangzhen on 19 Oct 2018

you can set the parameter num_clones=YOUR_GPU_NUMS

roadcode on 19 Oct 2018

@roadcode Thinks for your help. It's effectual.

SdustZhangzhen on 19 Oct 2018

you can set the parameter num_clones=YOUR_GPU_NUMS

Thanks @roadcode ! Is it possible to use multiple gpus with the new model_main.py?

wolfshow on 20 Oct 2018

@wolfshow probably not, you can see this issue #5421

roadcode on 21 Oct 2018

Please see my answer here.

pkulzc on 29 Oct 2018

Sorry to hijack this post
Hi @pkulzc ,
I'm using model_main.py for training. This file is evaluating for every 1200 steps but I want to evaluate for every 50k steps. I tried changing config file but no success. Could you please help me how to evaluate the model for every 50k steps?
My train set consists 700k images
val set consists 300k images
no.of.classes=100
Evaluation time is taking too much.
Thanks.

jillelajitta on 13 Nov 2018

@pkulzc why not fix this problem, the new main_train.py not support multi-gpu when slim is not use anymore

mlinxiang on 9 Jan 2019

@mlinxiang I didn't understand your question. For now you can only use multi-gpu with train.py. When we migrate to keras in the future, model_main.py will support multi-gpu as well.

pkulzc on 9 Jan 2019

Any progress on this task? Object detection training is bit useless on single GPU. Legacy train seems not working anymore.

Tantael on 29 Jan 2019

👍4

This is still ongoing. Moving to keras is a huge effort so it'll take some more time.

pkulzc on 10 Feb 2019

Really hoping this issue is prioritized highly, no multi-gpu training is a gigantic downside compared to other object detection frameworks! Legacy train.py also seems broken now.

austinmw on 5 Apr 2019

👍9

We have made some progress - now we have a few keras based models that is working and we are testing multi-gpu training. We will do a release once they are ready.

pkulzc on 26 Jun 2019

👍7 🎉1

Thanks for the update @pkulzc

Manish-rai21bit on 26 Jun 2019

@pkulzc Hi. How's this going? Can we get access to the keras based models that are working?

evolu8 on 25 Jul 2019

👍3

great job~~

xxllp on 5 Aug 2019

too slow ~~~

xxllp on 23 Aug 2019

I've been trying the Keras models. With the configs I tried, the retinanet failed due to misnaming of the layers being targeted. And the FRCNN succeeded to train (single GPU) but then failed to compile for inference, due to named weights being incorectly named in the checkpoint (box predictor weights getting a '_1' appended to parts of the name.

Do you have configs that work? Ideally for Keras feature extractor flavour retinanet on multi GPU?

evolu8 on 23 Aug 2019

is multi-gpu training supporting now?

devarajnadiger on 11 Oct 2019

👍7

any updates?

Ademord on 17 Feb 2020

Multi-GPU training of Faster/Mask R-CNN is supported in https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN, open sourced for 2+ years and was able to reproduce results in papers.

ppwwyyxx on 17 Feb 2020

👍1

the following use only 1 gpu. But use legacy/train.py, can use multiple gpu.
python object_detection/model_main.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr
https://github.com/tensorflow/models/issues/6611

the following can use 4 gpu, with sync_replicas: False in pipeline config, and batchsize= pergpu x 4:
python object_detection/legacy/train.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

northeastsquare on 17 Jul 2020

the following use only 1 gpu. But use legacy/train.py, can use multiple gpu.
python object_detection/model_main.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

6611

the following can use 4 gpu, with sync_replicas: False in pipeline config, and batchsize= pergpu x 4:
python object_detection/legacy/train.py --train_dir=./models/chk_union_model_mobilenetv3_small_4gpu/ --train_distribute=true --worker_replicas=4 --num_clones=4 --ps_tasks=1 --pipeline_config_path=object_detection/chk_union_data/aigc_ssdlite_mobilenet_v3_small_320x320_coco_4gpu.config --model_dir=models/chk_union_model_mobilenetv3_small_4gpu --num_train_steps=200000 --sample_1_of_n_eval_examples=1 --alsologtostderr

which branch did you use?

simonJJJ on 24 Jul 2020

master

northeastsquare on 30 Jul 2020

Looks like I am able to start multi-gpu training by setting num_workers to >1 in models/research/object_detection/model_main_tf2.py and nvidia-smi does show utilization. Will update soon about inference.

aabbas90 on 5 Aug 2020

Looks like I am able to start multi-gpu training by setting num_workers to >1 in models/research/object_detection/model_main_tf2.py and nvidia-smi does show utilization. Will update soon about inference.

Btw, the checkpoint created through this way is only possible to be used with multi-gpu training. If I set num_workers to 1, I get same error as here: #8892

aabbas90 on 5 Aug 2020

@aabbas90 what changes are required in the config file?

zishanahmed08 on 5 Aug 2020

@aabbas90 what changes are required in the config file?

No changes in the config file w.r.t multi-gpu training. If single gpu training is working fine for you then only the steps I mentioned above are required.

aabbas90 on 5 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings