Mask_rcnn: Training from the sample file uses CPU instead of GPU

Created on 6 Oct 2018  路  7Comments  路  Source: matterport/Mask_RCNN

I've already followed the installation steps here, and have CUDA and cuDNN installed. However, when I try to use one of the sample files for training, it seems like it's using CPU rather than utilizing GPU:

edmond@edmond-OptiPlex-3020:~/Desktop/Mask_RCNN/samples/balloon$ python balloon.py train --dataset=../../datasets/balloon --weights=coco
/home/edmond/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Weights:  coco
Dataset:  ../../datasets/balloon
Logs:  /home/edmond/Desktop/Mask_RCNN/logs

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     2
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.9
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 2
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  1024
IMAGE_META_SIZE                14
IMAGE_MIN_DIM                  800
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [1024 1024    3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [123.7 116.8 103.9]
MINI_MASK_SHAPE                (56, 56)
NAME                           balloon
NUM_CLASSES                    2
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.7
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                100
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           200
USE_MINI_MASK                  True
USE_RPN_ROIS                   True
VALIDATION_STEPS               50
WEIGHT_DECAY                   0.0001


Loading weights  /home/edmond/Desktop/Mask_RCNN/mask_rcnn_coco.h5
2018-10-05 19:16:32.287563: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Training network heads

Starting at epoch 0. LR=0.001

Checkpoint Path: /home/edmond/Desktop/Mask_RCNN/logs/balloon20181005T1916/mask_rcnn_balloon_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
In model:  rpn_model
    rpn_conv_shared        (Conv2D)
    rpn_class_raw          (Conv2D)
    rpn_bbox_pred          (Conv2D)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)
/home/edmond/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/edmond/anaconda3/lib/python3.6/site-packages/keras/engine/training_generator.py:47: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'
Epoch 1/30

The program gets stuck for about a minute after the last line.

While the training is running, the GPU usage doesn't change at all:

edmond@edmond-OptiPlex-3020:~/Desktop/Mask_RCNN/samples/balloon$ nvidia-smi
Fri Oct  5 19:18:59 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960     Off  | 00000000:01:00.0  On |                  N/A |
| 22%   48C    P5    17W / 130W |    501MiB /  4035MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1104      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1141      G   /usr/bin/gnome-shell                          49MiB |
|    0      1399      G   /usr/lib/xorg/Xorg                           219MiB |
|    0      1517      G   /usr/bin/gnome-shell                         123MiB |
|    0      2158      G   ...uest-channel-token=12487758558754920652    59MiB |
+-----------------------------------------------------------------------------+

However, it'll try to devour as much CPU power as possible. Below is a screenshot of htop monitor while it's running:
Here

None of the files related to this training has been altered from the current version of the repo.

Most helpful comment

if you run 'pip install -r requirements.txt', you will install a tensorflow without gpu, change the requirements.txt, replace tensorflow>=1.3.0 as tensorflow-gpu>=1.3.0

511

All 7 comments

I am having the same problem. I dont know to use GPU for training.

I know what happen with your problems.

I know what happen with your problems.

@acv-anvt Do you have any solutions for it, then?

@waleedka Do you have any suggestions for troubleshooting?

if you run 'pip install -r requirements.txt', you will install a tensorflow without gpu, change the requirements.txt, replace tensorflow>=1.3.0 as tensorflow-gpu>=1.3.0

511

@hj3yoo May be you missing the config to set cuda visible device:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"

AND

 with tf.device('/device:GPU:0'):
        model.train(dataset_train, dataset_val,
                learning_rate=config.LEARNING_RATE,
                epochs=30,
                layers='heads')

if you run 'pip install -r requirements.txt', you will install a tensorflow without gpu, change the requirements.txt, replace tensorflow>=1.3.0 as tensorflow-gpu>=1.3.0

511

After some headaches with CUDA compatibility and such, I've managed to start the training :D

The first epoch was successful, so let's hope everything goes well.

I'll close the issue once the training is complete (probably within a day).

I was having issues even with tensorflow-gpu installed
I deleted the environment, created a new one and then proceded to install the dependencies as suggested by @hoangcuongbk80

since I work in a deepstation with no sudo rights, I am limited by the drivers installed
So I also specified the tensorflow version:
tensorflow-gpu==1.8.0

Was this page helpful?
0 / 5 - 0 ratings