Mask_rcnn: Training is slow

Created on 5 Dec 2017  路  15Comments  路  Source: matterport/Mask_RCNN

Hi @waleedka , when I am training the model with 8 GPUs in a single machine, as the training procedure goes, the speed slows down. Any suggestions to improve my situation? How long does it take you to finish the training process with P100? Many thanks.

Most helpful comment

Code is in https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks. But there aren't many high-level docs. Some are only briefly mentioned here although the content might be a bit outdated.

All 15 comments

One possible reason is if your GPUs are faster than the processes that load and prepare the data. Try increasing the number of workers for the data generator and see if that improves the situation.

Another possibility, if you're doing multiple stages of training and each has different number of layers, then the more layers you include in training the more time it takes.

I haven't tried on P100, so it's hard to tell.

Thanks for the reply. I try to adjust the number of workers, it doesn't work. I look deep into fit_generator(), and find that it reads data smoothly, but training the model seems to be the bottleneck. Specifically, I notice that training with 5-8 GPUs is very slow, but with 1-4 GPUs turns to be fast (so weird). Could that be a problem from "parallel_model"? Thank you.

That's odd. It's hard to guess what the issue might be without digging deeper into it. The parallel_model uses the same logic regardless of the number of GPUs, so I don't think an issue there would explain why the speed decreases once you hit 5 GPUs. My guess is that you're hitting some type of limit, like memory limit causing disk swapping, or a bandwidth limit communicating between cards or something along these lines. Can you post more details and numbers?

I am using P100 (with PCIE). I guess GPU memory should not be a problem. The memory is also sufficiently large. When I was monitoring nvidia-smi, I notice that for quite a long period, the GPU utility all go down to 0 (8 GPUs). During this period, CPU utility is between 10%-30%, and there is no obvious hard disk IO (monitored with iostate) . I am wondering what is the machine doing during that period.

Also, when I am training with two M40 GPUs, the training process goes slower and slower....

Keras itself has several overhead already. parallel_model.py only put one copy of the parameters on one device which may lead to large communication cost. The best known strategies to scale to multiple GPUs are in official TF benchmark. With those strategies a mask-rcnn should get over 90% GPU utilization on 8 P100s.

the GPU utility all go down to 0 (8 GPUs). During this period, CPU utility is between 10%-30%, and there is no obvious hard disk IO

That seems to imply that either it's waiting for data transfer to complete (for example, copying data from CPU to GPUs or vice versa), or it's stuck waiting for an OP to finish before it can start the next batch.

  • Are you using mini-masks or the full-size masks? The full size masks can get really big if your images are big.
  • You can probably try training on the Shapes dataset and see if you encounter the same problem. The shapes dataset doesn't need to load anything from disk, so it would help rule out the possibility of a delay in data loading.

As for why it gets slower with time, I cannot think of any reason why that would happen. The only thing that gets bigger with time that I can think of is the events file. Are you using a standard file system, or something special that might get slower the bigger the files get?

Thanks guys. I think I have fixed the problem.

@jiang1st hey bro, how did you fix the problem?I think I encountered a similar problem.And it's really odd to see that the training speed is even slower when I am using multiple GPUs (comparing to single GPU).

@ppwwyyxx

The best known strategies to scale to multiple GPUs are in official TF benchmark.

What are the strategies exactly? I can't seem to find them other than some reference parameters.

Code is in https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks. But there aren't many high-level docs. Some are only briefly mentioned here although the content might be a bit outdated.

Thanks guys. I think I have fixed the problem.

Hi guys, how did you fix this problem? I met the same one that the gpu-util is almost close to 0, but the gpu memory was full loaded. Thank you in advance.

I have a similar issue:
My own dataset (small... 154 images)... Masks are just the bounding boxes.

GPU Utilization is 0-9%, mostly around 1% (I'm thinking opening the images, because I know it's loading the mask for each image since I put a debugging print in the load_mask of the dataset).

Memory is at 80-95% of GPU (GTX 1070, so nothing special there)
Using TensorFlow 1.12 GPU, CUDA 9.0, Keras 2.1.6
Batch size is 1 (1 image/GPU because of memory limits on GPU)
CPU Utilization is 100% on 4 cores (all pegged on python).
Training, as a result, is extremely slow...

I've tried including a with tf.device('/gpu:0'): around all model building and training and setting os.environ["CUDA_VISIBLE_DEVICES"]... But still the process is utilizing the CPU rather than GPU for processing.

@eselkin, have you tried using a dictionary to cache the loaded images and masks? This can be done easily by overloading the load_image and load_mask functions. When I did this, the GPU utilisation bumped up significantly.

I have a similar issue:
My own dataset (small... 154 images)... Masks are just the bounding boxes.

GPU Utilization is 0-9%, mostly around 1% (I'm thinking opening the images, because I know it's loading the mask for each image since I put a debugging print in the load_mask of the dataset).

Memory is at 80-95% of GPU (GTX 1070, so nothing special there)
Using TensorFlow 1.12 GPU, CUDA 9.0, Keras 2.1.6
Batch size is 1 (1 image/GPU because of memory limits on GPU)
CPU Utilization is 100% on 4 cores (all pegged on python).
Training, as a result, is extremely slow...

I've tried including a with tf.device('/gpu:0'): around all model building and training and setting os.environ["CUDA_VISIBLE_DEVICES"]... But still the process is utilizing the CPU rather than GPU for processing.

Hi @eselkin
I am facing same issue, did you get any solution of this issue as this has increased the training time?

Was this page helpful?
0 / 5 - 0 ratings