Mask_rcnn: Cannot increase number of GPUs on gcloud VM installation

Created on 27 Sep 2018 · 3Comments · Source: matterport/Mask_RCNN

Hi
I am trying to run MaskR-CNN on google cloud. I created a VM with 2 GPUS (Tesla K80)

I have installed:

# Name                    Version                   Build  Channel
python          3.6.6
Keras                     2.2.2                     <pip>
Keras-Applications        1.0.4                     <pip>
Keras-Preprocessing       1.0.2                     <pip>
tensorflow-gpu            1.10.1                    <pip>
maskR-cnn  2.1

When I set GPU_COUNT = 2 in the configuration and try to initialise the model

# Create model in training mode
model = modellib.MaskRCNN(mode="training", config=config,
                          model_dir=MODEL_DIR)

i get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/keras/engine/network.py in __setattr__(self, name, value)
    312             try:
--> 313                 is_graph_network = self._is_graph_network
    314             except AttributeError:

~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/parallel_model.py in __getattribute__(self, attrname)
     45             return getattr(self.inner_model, attrname)
---> 46         return super(ParallelModel, self).__getattribute__(attrname)
     47 

AttributeError: 'ParallelModel' object has no attribute '_is_graph_network'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-16-7928c4edfc77> in <module>()
      1 # Create model in training mode
      2 model = modellib.MaskRCNN(mode="training", config=config,
----> 3                           model_dir=MODEL_DIR)

~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/model.py in __init__(self, mode, config, model_dir)
   1843         self.model_dir = model_dir
   1844         self.set_log_dir()
-> 1845         self.keras_model = self.build(mode=mode, config=config)
   1846 
   1847     def build(self, mode, config):

~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/model.py in build(self, mode, config)
   2068         if config.GPU_COUNT > 1:
   2069             from mrcnn.parallel_model import ParallelModel
-> 2070             model = ParallelModel(model, config.GPU_COUNT)
   2071 
   2072         return model

~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/parallel_model.py in __init__(self, keras_model, gpu_count)
     33         gpu_count: Number of GPUs. Must be > 1
     34         """
---> 35         self.inner_model = keras_model
     36         self.gpu_count = gpu_count
     37         merged_outputs = self.make_parallel()

~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/keras/engine/network.py in __setattr__(self, name, value)
    314             except AttributeError:
    315                 raise RuntimeError(
--> 316                     'It looks like you are subclassing `Model` and you '
    317                     'forgot to call `super(YourClass, self).__init__()`.'
    318                     ' Always start with this line.')

RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.

It can run on a GPU but after a while the training crashes due to memory issues.

Any idea?
Thanks for the help!

NB: The output of

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11110617905889698381
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 7771084451487170049
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 15069942265122474649
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11281429300
locality {
  bus_id: 1
  links {
    link {
      device_id: 1
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 7523308974350294539
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 11281553818
locality {
  bus_id: 1
  links {
    link {
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 2569808516305462550
physical_device_desc: "device: 1, name: Tesla K80, pci bus id: 0000:00:05.0, compute capability: 3.7"
]

Source

simone-codeluppi

Most helpful comment

I didn't need to downgrade keras. Simply following the suggestion in the error message was sufficient, i.e., add the line super(ParallelModel, self).__init__() to parallel_model.py directly after the initial comment in def __init__(self, keras_model, gpu_count): (line 30).

florian-koenig on 13 Dec 2018

👍4

All 3 comments

Hi
in order to use multiple GPUs I needed to downgrade keras 2.2.2 to kera 2.1.3

simone-codeluppi on 9 Oct 2018

florian-koenig on 13 Dec 2018

👍4

@simone-codeluppi, @florian-koenig can you please share what is the inference speed in seconds or milliseconds per picture on your google cloud machine?

In my case, the inference speed (n1-standard-4 google cloud vm with Tesla P4 GPU) is ~5.4s / 1024 x 1024 picture on a tensorflow-gpu 1.12 setup (see details here https://github.com/matterport/Mask_RCNN/issues/1270).

Also, I did profiling on a V100 GPU with 16 GB memory and the inference speed is a little bit better ~3s / 1024 x 1024 picture, but still far behind the 200-300ms benchmark metioned in this repository. For reference, the V100 GPU machine has the following setup: tensorflow-gpu==1.8.0, keras==2.1.5, nvcc V9.0.176, cudnn 7.4.2, python 3.6.8.