Mask_rcnn: multi-gpu error during training

Created on 10 Apr 2018  路  15Comments  路  Source: matterport/Mask_RCNN

Just cloned the repo, changed only the num of gpus from 1 to 2 and it produces the following error:

Traceback (most recent call last):
  File "dsb.py", line 279, in <module>
    model_dir=MODEL_DIR)
  File "./mrcnn/model.py", line 1820, in __init__
    self.keras_model = self.build(mode=mode, config=config)
  File "./mrcnn/model.py", line 2039, in build
    model = ParallelModel(model, config.GPU_COUNT)
  File "/home/user/Mask_RCNN/mrcnn/parallel_model.py", line 37, in __init__
    merged_outputs = self.make_parallel()
  File "/home/user/Mask_RCNN/mrcnn/parallel_model.py", line 81, in make_parallel
    outputs = self.inner_model(inputs)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/engine/topology.py", line 2085, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/engine/topology.py", line 2236, in run_internal_graph
    output_tensors = _to_list(layer.call(computed_tensor, **kwargs))
  File "/home/user/anaconda3/envs/ipy/lib/python3.6/site-packages/keras/layers/core.py", line 663, in call
    return self.function(inputs, **arguments)
  File "./mrcnn/model.py", line 1913, in <lambda>
    anchors = KL.Lambda(lambda x: tf.constant(anchors), name="anchors")(input_image)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 433, in make_tensor_proto
    _AssertCompatible(values, dtype)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 341, in _AssertCompatible
    raise TypeError("List of Tensors when single Tensor expected")
TypeError: List of Tensors when single Tensor expected

Most helpful comment

@waleedka Ok fixed this as well. This is f** keras generator error and might be also the solution for #395.

The solution is:
keras.__version__ = 2.1.5
edit anaconda/lib/python3.6/site-packages/keras/engine/training.py go to line 2404 and change this

averages.append(np.average([out[i] for out in all_outs], weights=batch_sizes))

to

averages.append(
     np.average(
             np.squeeze(np.asarray([out[i] for out in all_outs])),
             weights=batch_sizes
            )
     )

All 15 comments

Fixed it by changing line 1913 in model.py
form:
anchors = KL.Lambda(lambda x: tf.constant(anchors), name="anchors")(input_image)
to:
anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)

Although increasing the number of images seems to halt training, something which I haven't figured out yet.

@kirk86 Thanks for the report and solution. I don't have a multi-GPU system at the moment, so I couldn't test that case. Does your fix work for the single GPU case as well?

@waleedka Hi, in the single gpu works fine but on multi-gpu I left it running last night and today I got another error:

299/300 [============================>.] - ETA: 6s - loss: 0.2768 - rpn_class_loss: 0.0034 - rpn_bbox_loss: 0.0477 - mrcnn_class_loss: 0.0263 - mrcnn_bbox_loss: 0.0389 - mrcnn_mask_loss: 0.1604 /home/user/anaconda3ipy/lib/python3.6/site-packages/keras/engine/training.py:2330: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'
Traceback (most recent call last):
  File "dsb.py", line 299, in <module>
    layers="all")
  File "./mrcnn/model.py", line 2318, in train
    use_multiprocessing=True,
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/engine/training.py", line 2244, in fit_generator
    max_queue_size=max_queue_size)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/keras/engine/training.py", line 2405, in evaluate_generator
    weights=batch_sizes))
  File "/home/user/anaconda3/ipy/lib/python3.6/site-packages/numpy/lib/function_base.py", line 1142, in average
    "Axis must be specified when shapes of a and weights "
TypeError: Axis must be specified when shapes of a and weights differ.

From my understanding this happens on the validation step since there's already one iteration of training finished. Now, I've also change the evaluation configuration to be 2 GPUs and 2 IMAGES_PER_GPU same as the one for training. I believe that the error indicates some kind of operation where either there's no specification of axis or keepdims=True?

@waleedka Ok fixed this as well. This is f** keras generator error and might be also the solution for #395.

The solution is:
keras.__version__ = 2.1.5
edit anaconda/lib/python3.6/site-packages/keras/engine/training.py go to line 2404 and change this

averages.append(np.average([out[i] for out in all_outs], weights=batch_sizes))

to

averages.append(
     np.average(
             np.squeeze(np.asarray([out[i] for out in all_outs])),
             weights=batch_sizes
            )
     )

Notice if someone uses configuration such as:

GPUs = 2
IMAGES_PER_GPU=x
where x >2 then the program halts for a very very very long time.

I believe it is also keras related. Behind the scenes the tensorflow allocator class it's trying to allocate memory but run out of resources and this message somehow gets suppressed (I don't know why), but if you kill the program with Ctrl-c then you see immediately the message from tensorflow allocator, but during normal run of the program you don't get anything and you wait 馃 that everything works fine.

Thanks @kirk86 I pushed the fix you suggested for tf.constant -> tf.variable.

@kirk86
Thank you.
In addition, I fixed this:
STEPS_PER_EPOCH = (657 - len(VAL_IMAGE_IDS)) // (IMAGES_PER_GPU* GPU_COUNT)
VALIDATION_STEPS = max(1, len(VAL_IMAGE_IDS) // (IMAGES_PER_GPU * GPU_COUNT))

I just pushed a fix for parallel_model to get around the error in validation when on multiple GPUs. This fixes the problem without having to patch Keras. Getting multi-GPU training to work on Keras is a bit tricky, so I hope this fix doesn't break something else. I did a lot of testing and so far it looks good.

@waleedka I pulled that commit, but unfortunately I still run into this error (Python 3.6, tf 1.8):

Traceback (most recent call last):
  File "samples/coco/pascal.py", line 425, in <module>
    augmentation=augmentation)
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/mask_rcnn-2.1-py3.6.egg/mrcnn/model.py", line 2325, in train
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/Keras-2.1.5-py3.6.egg/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/Keras-2.1.5-py3.6.egg/keras/engine/training.py", line 2244, in fit_generator
    max_queue_size=max_queue_size)
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/Keras-2.1.5-py3.6.egg/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/Keras-2.1.5-py3.6.egg/keras/engine/training.py", line 2405, in evaluate_generator
    weights=batch_sizes))
  File "/home/users/wenting.zhao/anaconda3/envs/py3/lib/python3.6/site-packages/numpy/lib/function_base.py", line 1142, in average
    "Axis must be specified when shapes of a and weights "
TypeError: Axis must be specified when shapes of a and weights differ.

Is there any additional info I can provide to help you reproduce this error? Thanks! (will look into more myself as well)

edited: the fix @kirk86 provided to modify keras does work.

@kirk86 Have you solved the "can't reach last step" problem? I have met the same one and would be very appreciate if you can help me.

I modified the keras file but it's still the same

It stucked here and the progress bar no longer go ahead.

Here is the Log:

/home/*****/miniconda3/lib/python3.6/site-packages/keras/engine/training.py:2023: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'
Epoch 1/1
 99/100 [============================>.] - ETA: 0s - loss: 1.3819 - rpn_class_loss: 0.0063 - rpn_bbox_loss: 0.3933 - mrcnn_class_loss: 0.0268 - mrcnn_bbox_loss: 0.4655 - mrcnn_mask_loss: 0.4899/home/zhangwenjie/miniconda3/lib/python3.6/site-packages/keras/engine/training.py:2251: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'

@kirk86 Have you solved the "can't reach last step" problem? I have met the same one and would be very appreciate if you can help me.

I modified the keras file but it's still the same

It stucked here and the progress bar no longer go ahead.

Here is the Log:

/home/*****/miniconda3/lib/python3.6/site-packages/keras/engine/training.py:2023: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'
Epoch 1/1
 99/100 [============================>.] - ETA: 0s - loss: 1.3819 - rpn_class_loss: 0.0063 - rpn_bbox_loss: 0.3933 - mrcnn_class_loss: 0.0268 - mrcnn_bbox_loss: 0.4655 - mrcnn_mask_loss: 0.4899/home/zhangwenjie/miniconda3/lib/python3.6/site-packages/keras/engine/training.py:2251: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'

same issue, did you solved that problem?

same here
I already modified that keras file but still got same problem

'divide by zero' disappeared for me after updating the TF version to latest.
python=2.7.6
keras=2.1.2
tf=1.11.0

I get nan losses whenever I train on multi-GPUs, but it works fine when trained on single GPU... has anyone faced this issue?

Was this page helpful?
0 / 5 - 0 ratings