Mask_rcnn: Training error - Failed to run optimizer, stage RemoveStackStridedSliceSameAxis

Created on 31 Jan 2019 · 13Comments · Source: matterport/Mask_RCNN

Hi, I'm trying to train the model with dataset coco 2017, but it reports error as following. Does anyone have the same problem? How to fix it? thanks!
I'm using Ubuntu 16.04 64bit, Python 3.6.7, pip 18.1, tensorflow_gpu 1.13.0-rc0, keras 2.2.4, cuda 10.0.130, libcudnn7-dev_7.4.2.24-1, libcudnn7_7.4.2.24-1

2019-01-31 10:11:33.060384: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
2019-01-31 10:11:33.060528: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice_37. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
2019-01-31 10:11:43.572615: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
2019-01-31 10:11:43.572742: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice_37. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
999/1000 [============================>.] - ETA: 0s - loss: 0.3451 - rpn_class_loss: 0.0034 - rpn_bbox_loss: 0.0729 - mrcnn_class_loss: 0.0632 - mrcnn_bbox_loss: 0.0454 - mrcnn_mask_loss: 0.16012019-01-31 10:24:40.241827: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
2019-01-31 10:24:40.241952: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice_37. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
2019-01-31 10:24:41.861193: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node proposal_targets/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [1,?,?] and [?,?]
1000/1000 [==============================] - 880s 880ms/step - loss: 0.3449 - rpn_class_loss: 0.0034 - rpn_bbox_loss: 0.0729 - mrcnn_class_loss: 0.0631 - mrcnn_bbox_loss: 0.0454 - mrcnn_mask_loss: 0.1600 - val_loss: 2.0419 - val_rpn_class_loss: 0.1070 - val_rpn_bbox_loss: 0.9081 - val_mrcnn_class_loss: 0.4638 - val_mrcnn_bbox_loss: 0.2041 - val_mrcnn_mask_loss: 0.3590
Epoch 6/12
1000/1000 [==============================] - 799s 799ms/step - loss: 0.3522 - rpn_class_loss: 0.0037 - rpn_bbox_loss: 0.0779 - mrcnn_class_loss: 0.0551 - mrcnn_bbox_loss: 0.0508 - mrcnn_mask_loss: 0.1647 - val_loss: 1.5071 - val_rpn_class_loss: 0.0384 - val_rpn_bbox_loss: 0.8576 - val_mrcnn_class_loss: 0.1833 - val_mrcnn_bbox_loss: 0.1751 - val_mrcnn_mask_loss: 0.2529

Source

hopstone

Most helpful comment

Hello,I meet the same problem? Could you tell me the anwser?

Hi, actually I haven't found the solution, but the model can be trained even these errors occurred. Hope for other answers.

hopstone on 1 Mar 2019

👍10

All 13 comments

Hello,I meet the same problem? Could you tell me the anwser?

wangjiaod on 1 Mar 2019

Hello,I meet the same problem? Could you tell me the anwser?

Hi, actually I haven't found the solution, but the model can be trained even these errors occurred. Hope for other answers.

hopstone on 1 Mar 2019

👍10

Thanks for your answer！I am in the same situation now,the model can be trained even these errors occurred.I hope solve the problem with you !

wangjiaod on 1 Mar 2019

Hello, Im having the same issue and at the end it doesn't create mask_rcnn_bottle_{epoch:04d}.h5 file. After that nothing works well. Can anybody help?
Thanks in Advance!

mihiri91 on 2 Apr 2019

I encountered the same behaviour (this warning, but could train). In my case, the problem was a version mismatch between tensorflow and tensorflow-gpu. Both were originally on version 1.12, but installing tensorboard through pip installed version 1.13 of tb and tf, but not tf-gpu.

Removing all versions and then installing 1.12.0 of all packages solved it for me.

anieuwland on 2 Apr 2019

I solve my issue till it generates .h5 files. It was a problem with GPU capacity(I think). Currently, I'm using google colab gpu runtime.

mihiri91 on 2 Apr 2019

@mihiri91 Wonder how did you solve your issue? What do you mean by "a problem with GPU capacity"? Not enough memory? I am using Nvidia docker and getting the same issue.

rabbitwayne on 18 May 2019

I'm having the same problem. I'm using tensorflow==1.13 and cuda==10
I didn't have this issue when I'm using cuda 9.0. is it because of that?

thomasyue on 23 May 2019

Hi all.
I have the same issue if I create too big batch for training - GPU can't get it in memory. When I reduce batch size - there is no such error!
Right now I'm training net to watch does it create .h5 files.

MikhailSam on 8 Jun 2019

👍2

For me, reducing the batch size (in particular reducing IMAGES_PER_GPU from 5 to 1, but keeping the number of GPUs at 8 was sufficient) worked too. Even with 2 images per GPU I got the error even though the GPUs have 32 GB VRAM each

plbecker on 4 Nov 2019

I changed my tensorflow-gpu version from 1.14.0 to 1.13.1, this issue happened, so I just install it back with pip install tensorflow-gpu==1.14.0, and the issue disappeared.

LucienZhang on 25 Dec 2019

👍6 🎉2 🚀1

I changed my tensorflow-gpu version from 1.14.0 to 1.13.1, this issue happened, so I just install it back with pip install tensorflow-gpu==1.14.0, and the issue disappeared.

This solved the issue.