Mask_rcnn: Different Training schedules for limited GPU

Created on 8 Mar 2018  路  8Comments  路  Source: matterport/Mask_RCNN

Hi

I have a binary dataset that I m training using resnet50 and MobileNet.
My graphics card is NVIDIA GeForce GTX 970(3.5 + 0.5 Gigs)

When i tried to train with resnet101 , I got the error

_OOM when allocating tensor with shape[200,28,28,256]_

This is probably because the memory is not sufficient to load the entire model.
When I train with "all" layers in training schedule for resnet50 and MobileNet , I get the same error.

So I have experimented with training schedules and the training runs for following
resnet50 - 3+ layers
mobilenet - 1+ layers

I intend to train using a training schedule
step 1-----1-40 epochs - train heads
step 2-----40-120 epochs - train 4+
step 3-----120-160 epochs - train layers 1 to 4
step 4-----160 -200 epochs - train layers 4+

Did anyone experiment with training schedules ? How does replacing fine tuning "all" layers as last training step by step 3 and step 4 affect the loss?

Most helpful comment

@samhsieh
I'd recommend you to save temp weights, restart your python kernel to free memory and then load back your weights + continue your training from where you stopped. Not the best solution but it still works!

All 8 comments

Hi
I got the similar problem: My graphics card is NVIDIA GeForce GTX 1080(8 GB).
When trained stage3 (122/160 epochs - train layers all) , the error message:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,256,256,256].
Even though changed the config of TRAIN_ROIS_PER_IMAGE(200 ->32) It still got the same failure,
Could you help suggest me to solve the OOM issue? Thanks.

@samhsieh I ended up only training till epochs 120.
But since you have a 8 GB card, I think maybe reducing batch size will let you train all layers. I am not sure if reducing TRAIN_ROIS_PER_IMAGE helps in this case. Other people have mentioned that they could train all layers with 8 GB GPU. check out those issues.

@samhsieh
I'd recommend you to save temp weights, restart your python kernel to free memory and then load back your weights + continue your training from where you stopped. Not the best solution but it still works!

Hi @Paulito-7
Did you experiment with the training schedules ? Do you know if there is significant change in the loss if i skip the train all layers step and train layers 1-3 for some epochs and 3 to heads for next epochs ?

Hi @gsujansai @Paulito-7

Thanks for your suggestion on how to handle the OOM issue.
It could keep training a new model after modified epochs 120.
BTW, I also upgraded the framework packages, including tensorflow_gpu from V1.4 to V1.5,
CUDA8 to CUDA9 and so on.

@gsujansai
Sorry for the late reply, I intend to try it during this week, I'll keep you updated!

Hello @samhsieh
I have been working on the process, it seems that you lose the gradient state if you save the weights and run an other instance of your program, so you have to save the whole model in order not to lose it, as https://github.com/matterport/Mask_RCNN/issues/308 mentions it.

Hello @samhsieh
I have started retraining Mask RCNN using the default Resnet101 backbone provided here, and the retrain from existing checkpoint. I only have an NVIDIA 960 GTX with 2 GB memory. During the first stage of schedule, upto 3 epochs, I didn't run into any error at all, even with a batch size of 2. Please let me know at what stage am I expected to run into Out Of Memory(OOM) errors.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Mhaiyang picture Mhaiyang  路  4Comments

wjdhuster2018 picture wjdhuster2018  路  3Comments

LifeBeyondExpectations picture LifeBeyondExpectations  路  4Comments

taewookim picture taewookim  路  4Comments

Mabinogiysk picture Mabinogiysk  路  3Comments