Mask_rcnn: Multi GPU error: InvalidArgumentError : ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0

Created on 16 May 2018  路  10Comments  路  Source: matterport/Mask_RCNN

With Ubuntu14.04, Python3.6.0, Tensorflow1.4.0, Keras2.0.8, using multi GPU output error as:

InvalidArgumentError : ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0


More detailed log is as follows (click me)

mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)
/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py:1987: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'
Epoch 1/30
2018-05-16 09:22:38.466424: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.466601: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.466769: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.466866: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.477960: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.554091: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.594089: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.596734: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.596861: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
2018-05-16 09:22:38.596941: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
Traceback (most recent call last):
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
     [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 381, in <module>
    train(model)
  File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 214, in train
    layers='heads')
  File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 2329, in train
    use_multiprocessing=True,
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py", line 2042, in fit_generator
    class_weight=class_weight)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py", line 1762, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2273, in __call__
    **self.session_kwargs)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
     [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'mrcnn_bbox_loss_1/concat', defined at:
  File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 348, in <module>
    model_dir=args.logs)
  File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 1824, in __init__
    self.keras_model = self.build(mode=mode, config=config)
  File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 2043, in build
    model = ParallelModel(model, config.GPU_COUNT)
  File "/home/ljy/Mask_RCNN/mrcnn/parallel_model.py", line 37, in __init__
    merged_outputs = self.make_parallel()
  File "/home/ljy/Mask_RCNN/mrcnn/parallel_model.py", line 102, in make_parallel
    m = KL.Concatenate(axis=0, name=name)(outputs)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/layers/merge.py", line 332, in call
    return K.concatenate(inputs, axis=self.axis)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1709, in concatenate
    return tf.concat([to_dense(x) for x in tensors], axis)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1099, in concat
    return gen_array_ops._concat_v2(values=values, axis=axis, name=name)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 706, in _concat_v2
    "ConcatV2", values=values, axis=axis, name=name)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
     [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]]
     [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Most helpful comment

I had the same problem. I fixed it replacing line 97 of parallel_model.py

if K.int_shape(outputs[0]) == ():

by

if K.int_shape(outputs[0]) == () or not K.int_shape(outputs[0]):

I use python=3.6.8, tensorflow-gpu=1.3.0, and keras=2.08.

I could not update tensorflow because I am using a cluster with Nvidia driver version 375.26 and tensorflow>1.4 is not compatible with it. And I do not have root access to change the driver.

I hope this is useful.

All 10 comments

I got the same problem with 4 GPUs. I have not found a solution

I don't have such a question. Can you tell me specifically how you set up multi GPU?

@liangbo-1 I have a machine with 4 GPUs. I just change the config parameter CPU_COUNT = 4

I am having the same problem on Ubuntu 16.04, Python 3.6, Tensorflow 1.4 and Keras 2.0.8

I have the same problem on Ubuntu 16.04, Python 3.5, TF 1.4 and keras 2.1.2. Exactly the same error.

@MichaelLiang12 @taijizhao @Nicolai-Haeni Have anyone solved it ? Maybe because of the TF version?

@MichaelLiang12 I upgrade to TF 1.8 and Keras 2.1.6 and the problem disappeared, at least using multiple GPUs on the shapes sample works fine.

I also have this Question. tf=1.3 keras=2.0.8 cuda=8.0 pyhon=3.4.0

how to slove it ??

@zhjpqq for me I upgrade my TF and Keras to the newest version and everything works fine now.

I had the same problem. I fixed it replacing line 97 of parallel_model.py

if K.int_shape(outputs[0]) == ():

by

if K.int_shape(outputs[0]) == () or not K.int_shape(outputs[0]):

I use python=3.6.8, tensorflow-gpu=1.3.0, and keras=2.08.

I could not update tensorflow because I am using a cluster with Nvidia driver version 375.26 and tensorflow>1.4 is not compatible with it. And I do not have root access to change the driver.

I hope this is useful.

Was this page helpful?
0 / 5 - 0 ratings