Mask_rcnn: When increasing the GPU count, the training becomes slower :-(

Created on 22 Jun 2018 · 9Comments · Source: matterport/Mask_RCNN

I encounter a strange issue; when I increase the GPU count from 1 to 2 to 3 to 4 (my max GPU count) the training time significantly decreases per epoch. Even when I use the CUDA_ACTIVE_DEVICES with the GPU ids. All GPU's are activated and utilized when training (so they are active)

The GPU's have 12GB memory, I tried both 2 images per GPU processing (Batch_size=8) as 1 image per GPU (Batch_size=4). The fastest training time is now (strangely) GPU_COUNT=1 and IMAGES_PER_GPU=1 (?).

Roughly it takes 25 minutes per epoch (COCO dataset training with 1000 cycles p.e.) with 1 activated GPU, 40 minutes with 2 GPU's, 50 minutes with 3 GPU's and 1hr10min with 4 GPU's.

Is it some issue in the parallel_model.py code (maybe in combination with Windows OS)? There is a clear trend visible (>GPU_COUNT = >TRAINING_TIME).

The specifications:

OS: Windows10
GPU: Nvidia Geforce GTX TitanX (4x)
GPU driver: 391.35
Tensorflow_gpu = 1.5
Keras version = 2.1.6
CNN architecture: Resnet50 (resnet50_notop.h5)
Starting weights=imagenet
Config settings: similar as Resnet101 COCO (coco.py) example

I tried to search the issue list, however I cannot find a similar issue. I don't understand the problem :-(

Source

pieterbl86

Most helpful comment

@pieterbl86 A few points that might shed some light on the topic.

Following Keras convention, an epoch doesn't always mean a full pass through the dataset. Rather, the STEPS_PER_EPOCH config setting allows you to control the number of steps per epoch. You can use small epochs to get more frequent updates in TensorBoard, or you can set it such that it corresponds with a full pass through the dataset.
BATCH_SIZE = GPU_COUNT * IMAGES_PER_GPU.
Therefore, images per epoch = STEPS_PER_EPOCH * IMAGES_PER_GPU * GPU_COUNT
When you increase the GPU_COUNT, you're also increasing the number of images you're training on per epoch. So it's normal for training an epoch to take longer. You're effectively training a larger epoch.
On Windows, the effect is more obvious. Keras has a feature that allows loading data in parallel on multiple CPU threads, but this doesn't work on Windows due to the way Python threads are implemented. So, most likely, the bottleneck in Windows is in data loading.

waleedka on 22 Sep 2018

👍5

All 9 comments

Similar to #589 (see multi GPU comment) .

ericj974 on 23 Jun 2018

Similar to #589 (see multi GPU comment) .

thanks for your response. Nevertheless, it doesn't work. I copied both your version of model.py and parallel_model.py into my mrcnn folder, but the issue remains :-(

*Update: when I create an Anaconda Env with all the libraries mentioned in post #710 it still doesn't work

pieterbl86 on 23 Jun 2018

I came to mention the post of @ericj974, I will try myself the changes and see how it goes.

schmidje on 23 Jun 2018

@schmidje ; were you able to reproduce the error?

pieterbl86 on 25 Jun 2018

I just installed this repository on Linux (Ubuntu 16.04) and somehow the multi-gpu is working properly now... Maybe some issue with Windows10 OS?

Remaining specs (on Ubuntu 16.04):
GPU driver: 384.130
Tensorflow_gpu = 1.8
Keras version = 2.1.6

pieterbl86 on 26 Jun 2018

I did not find yet the time to test the proposed changes of @ericj974. In your case a fresh checkout of the repo just fixed the issue or you applied the proposed changes?

I am on linux rhel BTW.

thx

schmidje on 26 Jun 2018

I did not find yet the time to test the proposed changes of @ericj974. In your case a fresh checkout of the repo just fixed the issue or you applied the proposed changes? I am on linux rhel BTW. thx>

I have a dual-boot computer with Ubuntu 16.04 and Windows10. Switching to Ubuntu 16.04 solved the issue, so based on that I think there might be some problem in the Windows related code/integration :-(

pieterbl86 on 26 Jun 2018

I think we may have found the explanation in #875.

maxfrei750 on 27 Aug 2018

@pieterbl86 A few points that might shed some light on the topic.

Following Keras convention, an epoch doesn't always mean a full pass through the dataset. Rather, the STEPS_PER_EPOCH config setting allows you to control the number of steps per epoch. You can use small epochs to get more frequent updates in TensorBoard, or you can set it such that it corresponds with a full pass through the dataset.
BATCH_SIZE = GPU_COUNT * IMAGES_PER_GPU.
Therefore, images per epoch = STEPS_PER_EPOCH * IMAGES_PER_GPU * GPU_COUNT
When you increase the GPU_COUNT, you're also increasing the number of images you're training on per epoch. So it's normal for training an epoch to take longer. You're effectively training a larger epoch.
On Windows, the effect is more obvious. Keras has a feature that allows loading data in parallel on multiple CPU threads, but this doesn't work on Windows due to the way Python threads are implemented. So, most likely, the bottleneck in Windows is in data loading.

waleedka on 22 Sep 2018

👍5

Was this page helpful?

0 / 5 - 0 ratings