Mask_rcnn: Training is slow

Created on 5 Dec 2017 · 15Comments · Source: matterport/Mask_RCNN

Hi @waleedka , when I am training the model with 8 GPUs in a single machine, as the training procedure goes, the speed slows down. Any suggestions to improve my situation? How long does it take you to finish the training process with P100? Many thanks.

Source

jiang1st

Most helpful comment

Code is in https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks. But there aren't many high-level docs. Some are only briefly mentioned here although the content might be a bit outdated.

ppwwyyxx on 19 Dec 2017

👍2

All 15 comments

One possible reason is if your GPUs are faster than the processes that load and prepare the data. Try increasing the number of workers for the data generator and see if that improves the situation.

Another possibility, if you're doing multiple stages of training and each has different number of layers, then the more layers you include in training the more time it takes.

I haven't tried on P100, so it's hard to tell.

waleedka on 5 Dec 2017

Thanks for the reply. I try to adjust the number of workers, it doesn't work. I look deep into fit_generator(), and find that it reads data smoothly, but training the model seems to be the bottleneck. Specifically, I notice that training with 5-8 GPUs is very slow, but with 1-4 GPUs turns to be fast (so weird). Could that be a problem from "parallel_model"? Thank you.

jiang1st on 5 Dec 2017

That's odd. It's hard to guess what the issue might be without digging deeper into it. The parallel_model uses the same logic regardless of the number of GPUs, so I don't think an issue there would explain why the speed decreases once you hit 5 GPUs. My guess is that you're hitting some type of limit, like memory limit causing disk swapping, or a bandwidth limit communicating between cards or something along these lines. Can you post more details and numbers?

waleedka on 5 Dec 2017

I am using P100 (with PCIE). I guess GPU memory should not be a problem. The memory is also sufficiently large. When I was monitoring nvidia-smi, I notice that for quite a long period, the GPU utility all go down to 0 (8 GPUs). During this period, CPU utility is between 10%-30%, and there is no obvious hard disk IO (monitored with iostate) . I am wondering what is the machine doing during that period.

jiang1st on 5 Dec 2017

Also, when I am training with two M40 GPUs, the training process goes slower and slower....

jiang1st on 6 Dec 2017

Keras itself has several overhead already. parallel_model.py only put one copy of the parameters on one device which may lead to large communication cost. The best known strategies to scale to multiple GPUs are in official TF benchmark. With those strategies a mask-rcnn should get over 90% GPU utilization on 8 P100s.

ppwwyyxx on 9 Dec 2017

👍2

the GPU utility all go down to 0 (8 GPUs). During this period, CPU utility is between 10%-30%, and there is no obvious hard disk IO

That seems to imply that either it's waiting for data transfer to complete (for example, copying data from CPU to GPUs or vice versa), or it's stuck waiting for an OP to finish before it can start the next batch.

Are you using mini-masks or the full-size masks? The full size masks can get really big if your images are big.
You can probably try training on the Shapes dataset and see if you encounter the same problem. The shapes dataset doesn't need to load anything from disk, so it would help rule out the possibility of a delay in data loading.

As for why it gets slower with time, I cannot think of any reason why that would happen. The only thing that gets bigger with time that I can think of is the events file. Are you using a standard file system, or something special that might get slower the bigger the files get?

waleedka on 10 Dec 2017

Thanks guys. I think I have fixed the problem.

jiang1st on 11 Dec 2017

@jiang1st hey bro, how did you fix the problem?I think I encountered a similar problem.And it's really odd to see that the training speed is even slower when I am using multiple GPUs (comparing to single GPU).

shenyi1028 on 12 Dec 2017

@ppwwyyxx

The best known strategies to scale to multiple GPUs are in official TF benchmark.

What are the strategies exactly? I can't seem to find them other than some reference parameters.

qianyizhang on 19 Dec 2017

ppwwyyxx on 19 Dec 2017

👍2

Thanks guys. I think I have fixed the problem.

Hi guys, how did you fix this problem? I met the same one that the gpu-util is almost close to 0, but the gpu memory was full loaded. Thank you in advance.

songl17 on 30 Oct 2018

👍1

I have a similar issue:
My own dataset (small... 154 images)... Masks are just the bounding boxes.

GPU Utilization is 0-9%, mostly around 1% (I'm thinking opening the images, because I know it's loading the mask for each image since I put a debugging print in the load_mask of the dataset).

Memory is at 80-95% of GPU (GTX 1070, so nothing special there)
Using TensorFlow 1.12 GPU, CUDA 9.0, Keras 2.1.6
Batch size is 1 (1 image/GPU because of memory limits on GPU)
CPU Utilization is 100% on 4 cores (all pegged on python).
Training, as a result, is extremely slow...

I've tried including a with tf.device('/gpu:0'): around all model building and training and setting os.environ["CUDA_VISIBLE_DEVICES"]... But still the process is utilizing the CPU rather than GPU for processing.

eselkin on 7 Jan 2019

@eselkin, have you tried using a dictionary to cache the loaded images and masks? This can be done easily by overloading the load_image and load_mask functions. When I did this, the GPU utilisation bumped up significantly.

tom-a-horrocks on 22 Feb 2019

I have a similar issue:
My own dataset (small... 154 images)... Masks are just the bounding boxes.

GPU Utilization is 0-9%, mostly around 1% (I'm thinking opening the images, because I know it's loading the mask for each image since I put a debugging print in the load_mask of the dataset).

Memory is at 80-95% of GPU (GTX 1070, so nothing special there)
Using TensorFlow 1.12 GPU, CUDA 9.0, Keras 2.1.6
Batch size is 1 (1 image/GPU because of memory limits on GPU)
CPU Utilization is 100% on 4 cores (all pegged on python).
Training, as a result, is extremely slow...

I've tried including a with tf.device('/gpu:0'): around all model building and training and setting os.environ["CUDA_VISIBLE_DEVICES"]... But still the process is utilizing the CPU rather than GPU for processing.

Hi @eselkin
I am facing same issue, did you get any solution of this issue as this has increased the training time?

shubhbrth on 27 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

No file for valminusminival for dataset year 2017. How is valminusminival obtained?

JonathanCMitchell · 3Comments

Cannot run demo.ipynb: ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory

AndreaPi · 3Comments

ValueError: The channel dimension of the inputs should be defined. Found `None`. in demo.ipynb

taewookim · 4Comments

windows support?

wjdhuster2018 · 3Comments

How to get exact architecture details of MASK RCNN model?

Jargon4072 · 3Comments