Hi
I am trying to run MaskR-CNN on google cloud. I created a VM with 2 GPUS (Tesla K80)
I have installed:
# Name Version Build Channel
python 3.6.6
Keras 2.2.2 <pip>
Keras-Applications 1.0.4 <pip>
Keras-Preprocessing 1.0.2 <pip>
tensorflow-gpu 1.10.1 <pip>
maskR-cnn 2.1
When I set GPU_COUNT = 2 in the configuration and try to initialise the model
# Create model in training mode
model = modellib.MaskRCNN(mode="training", config=config,
model_dir=MODEL_DIR)
i get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/keras/engine/network.py in __setattr__(self, name, value)
312 try:
--> 313 is_graph_network = self._is_graph_network
314 except AttributeError:
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/parallel_model.py in __getattribute__(self, attrname)
45 return getattr(self.inner_model, attrname)
---> 46 return super(ParallelModel, self).__getattribute__(attrname)
47
AttributeError: 'ParallelModel' object has no attribute '_is_graph_network'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-16-7928c4edfc77> in <module>()
1 # Create model in training mode
2 model = modellib.MaskRCNN(mode="training", config=config,
----> 3 model_dir=MODEL_DIR)
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/model.py in __init__(self, mode, config, model_dir)
1843 self.model_dir = model_dir
1844 self.set_log_dir()
-> 1845 self.keras_model = self.build(mode=mode, config=config)
1846
1847 def build(self, mode, config):
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/model.py in build(self, mode, config)
2068 if config.GPU_COUNT > 1:
2069 from mrcnn.parallel_model import ParallelModel
-> 2070 model = ParallelModel(model, config.GPU_COUNT)
2071
2072 return model
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/parallel_model.py in __init__(self, keras_model, gpu_count)
33 gpu_count: Number of GPUs. Must be > 1
34 """
---> 35 self.inner_model = keras_model
36 self.gpu_count = gpu_count
37 merged_outputs = self.make_parallel()
~/anaconda3/envs/mask-rcnn_env/lib/python3.6/site-packages/keras/engine/network.py in __setattr__(self, name, value)
314 except AttributeError:
315 raise RuntimeError(
--> 316 'It looks like you are subclassing `Model` and you '
317 'forgot to call `super(YourClass, self).__init__()`.'
318 ' Always start with this line.')
RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.
It can run on a GPU but after a while the training crashes due to memory issues.
Any idea?
Thanks for the help!
NB: The output of
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
is
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11110617905889698381
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 7771084451487170049
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 15069942265122474649
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11281429300
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 7523308974350294539
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 11281553818
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 2569808516305462550
physical_device_desc: "device: 1, name: Tesla K80, pci bus id: 0000:00:05.0, compute capability: 3.7"
]
Hi
in order to use multiple GPUs I needed to downgrade keras 2.2.2 to kera 2.1.3
I didn't need to downgrade keras. Simply following the suggestion in the error message was sufficient, i.e., add the line super(ParallelModel, self).__init__() to parallel_model.py directly after the initial comment in def __init__(self, keras_model, gpu_count): (line 30).
@simone-codeluppi, @florian-koenig can you please share what is the inference speed in seconds or milliseconds per picture on your google cloud machine?
In my case, the inference speed (n1-standard-4 google cloud vm with Tesla P4 GPU) is ~5.4s / 1024 x 1024 picture on a tensorflow-gpu 1.12 setup (see details here https://github.com/matterport/Mask_RCNN/issues/1270).
Also, I did profiling on a V100 GPU with 16 GB memory and the inference speed is a little bit better ~3s / 1024 x 1024 picture, but still far behind the 200-300ms benchmark metioned in this repository. For reference, the V100 GPU machine has the following setup: tensorflow-gpu==1.8.0, keras==2.1.5, nvcc V9.0.176, cudnn 7.4.2, python 3.6.8.
Most helpful comment
I didn't need to downgrade keras. Simply following the suggestion in the error message was sufficient, i.e., add the line
super(ParallelModel, self).__init__()toparallel_model.pydirectly after the initial comment indef __init__(self, keras_model, gpu_count):(line 30).