Mask_rcnn: Different Training schedules for limited GPU

Created on 8 Mar 2018 · 8Comments · Source: matterport/Mask_RCNN

I have a binary dataset that I m training using resnet50 and MobileNet.
My graphics card is NVIDIA GeForce GTX 970(3.5 + 0.5 Gigs)

When i tried to train with resnet101 , I got the error

_OOM when allocating tensor with shape[200,28,28,256]_

This is probably because the memory is not sufficient to load the entire model.
When I train with "all" layers in training schedule for resnet50 and MobileNet , I get the same error.

So I have experimented with training schedules and the training runs for following
resnet50 - 3+ layers
mobilenet - 1+ layers

I intend to train using a training schedule
step 1-----1-40 epochs - train heads
step 2-----40-120 epochs - train 4+
step 3-----120-160 epochs - train layers 1 to 4
step 4-----160 -200 epochs - train layers 4+

Did anyone experiment with training schedules ? How does replacing fine tuning "all" layers as last training step by step 3 and step 4 affect the loss?

Source

gsujan

Most helpful comment

@samhsieh
I'd recommend you to save temp weights, restart your python kernel to free memory and then load back your weights + continue your training from where you stopped. Not the best solution but it still works!

Paulito-7 on 25 Mar 2018

👍2

All 8 comments

Hi
I got the similar problem: My graphics card is NVIDIA GeForce GTX 1080(8 GB).
When trained stage3 (122/160 epochs - train layers all) , the error message:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,256,256,256].
Even though changed the config of TRAIN_ROIS_PER_IMAGE(200 ->32) It still got the same failure,
Could you help suggest me to solve the OOM issue? Thanks.

samhsieh on 23 Mar 2018

👍1

@samhsieh I ended up only training till epochs 120.
But since you have a 8 GB card, I think maybe reducing batch size will let you train all layers. I am not sure if reducing TRAIN_ROIS_PER_IMAGE helps in this case. Other people have mentioned that they could train all layers with 8 GB GPU. check out those issues.

gsujan on 24 Mar 2018

Paulito-7 on 25 Mar 2018

👍2

Hi @Paulito-7
Did you experiment with the training schedules ? Do you know if there is significant change in the loss if i skip the train all layers step and train layers 1-3 for some epochs and 3 to heads for next epochs ?

gsujan on 26 Mar 2018

Hi @gsujansai @Paulito-7

Thanks for your suggestion on how to handle the OOM issue.
It could keep training a new model after modified epochs 120.
BTW, I also upgraded the framework packages, including tensorflow_gpu from V1.4 to V1.5,
CUDA8 to CUDA9 and so on.

samhsieh on 28 Mar 2018

@gsujansai
Sorry for the late reply, I intend to try it during this week, I'll keep you updated!

Paulito-7 on 2 Apr 2018

Hello @samhsieh
I have been working on the process, it seems that you lose the gradient state if you save the weights and run an other instance of your program, so you have to save the whole model in order not to lose it, as https://github.com/matterport/Mask_RCNN/issues/308 mentions it.

Paulito-7 on 9 Apr 2018

Hello @samhsieh
I have started retraining Mask RCNN using the default Resnet101 backbone provided here, and the retrain from existing checkpoint. I only have an NVIDIA 960 GTX with 2 GB memory. During the first stage of schedule, upto 3 epochs, I didn't run into any error at all, even with a batch size of 2. Please let me know at what stage am I expected to run into Out Of Memory(OOM) errors.