Mask_rcnn: Even after using GPU, CPU cores are usage reaches to 100% and system crashes

Created on 25 Jun 2018  路  5Comments  路  Source: matterport/Mask_RCNN

I'm using a Titan X GPU on a Dell Precision tower with 12 cores. I'm running the training on GPU, the GPU usage reaches to 11.6GBs but at the same time all CPU cores reach to 100% usage and the system then crashes and reboots. But when I used the same card on another tower I was able to run the code. Can anyone suggest what might be the issue?

Most helpful comment

@endluo try setting multiprocessing=False in model.py and see if that fixes the issue. Crashing and rebooting sounds like a system stability issue to me, so try running some stress tests as described above. If the issue still persists send over some more information, such as TF version, Keras version, CUDA/CUDNN/Nvidia driver versions and make sure you are running the latest Mask-RCNN code. Good luck!

All 5 comments

maybe the IMAGES_PER_GPU should set 1 not 2

I faced a similar issue to you, and it appears at some point the max_queue_size parameter to fit_generator in model.py was set to 100.

This parameter essentially causes Keras to queue up 100 images (which in my case were quite large) in RAM before processing them. It ended up maxing out my RAM and crashing due to an OOM error. Adding swap solved it for me, but obviously made it quite slow.

Part of this queuing process hits the CPU cores quite hard, and I also found max CPU utilisation while Keras did this. Are you overclocking your CPU by any chance? Perhaps try using a tool such as stress and trying stress --cpu n_threads and obviously replacing n_threads with your number of CPU threads. Run this for a couple of hours and check your system is stable.

So in summary, check your system is stable with stress, and remove any overclocks. Reduce the max_queue_size and see if that helps.

@JoeLogan1981 sorry, the problem still exist...i don't know how to deal it.

@endluo try setting multiprocessing=False in model.py and see if that fixes the issue. Crashing and rebooting sounds like a system stability issue to me, so try running some stress tests as described above. If the issue still persists send over some more information, such as TF version, Keras version, CUDA/CUDNN/Nvidia driver versions and make sure you are running the latest Mask-RCNN code. Good luck!

@JoeLogan1981 it's work! thx

Was this page helpful?
0 / 5 - 0 ratings