Here is a snippet from the input_preprocess.py
in deeplab model (from line 122)
# Randomly crop the image and label.
if is_training and label is not None:
processed_image, label = preprocess_utils.random_crop(
[processed_image, label], crop_height, crop_width)
processed_image.set_shape([crop_height, crop_width, 3])
if label is not None:
label.set_shape([crop_height, crop_width, 1])
if is_training:
# Randomly left-right flip the image and label.
processed_image, label, _ = preprocess_utils.flip_dim(
[processed_image, label], _PROB_OF_FLIP, dim=1)
Obviously the crop is only performed during the train
mode but the processed_image
shape is set to [crop_height, crop_width]
.
This cause a problem when we evaluate the xception65
model which produces the following error
InvalidArgumentError (see above for traceback): padded_shape[0]=45 is not divisible by block_shape[0]=2
[[Node: xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](xception_65/exit_flow/block1/unit_1/xception_module/add, xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND/block_shape, xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND/paddings)]]
If we force the size of the input image to be [513, 513]
it works.
This test was done with Pascal VOC data set.
During eval, we always do whole-image inference, meaning you need to set eval_crop_size >= largest image dimension.
We always set crop_size = output_stride * k + 1, where k is an integer. When working on PASCAL images, the largest dimension is 512. Thus, we set crop_size = 513 = 16 * 32 + 1 > 512. Similarly, we set eval_crop_size = 1025x2049 for Cityscapes images.
Thanks @aquariusjay. We follow exactly your parameters on Pascal VOC. The crop size is set to 513. We notice that in the pre-processing code the image is randomly scaled even in eval mode. Is that correct ? After this data augmentation, the image size can be for example [670 1000 3]
and this causes the error on xception65
forward function.
If you need multi-scale inputs during inference, please call this function
https://github.com/tensorflow/models/blob/master/research/deeplab/model.py#L91
And this should have already been handled in eval.py
https://github.com/tensorflow/models/blob/master/research/deeplab/eval.py#L112
Do not use the pre-processing for multi-scale inputs during inference.
This problem appears when we perform single-scale test. We do not explicitly call pre-processing function during the evaluation, it is called in the get
of input_generator
And if you take a look at this function, the data augmentation is always performed even in eval
mode
We believe that this can be problematic, and in fact during the evaluation the input image can have a shape that is not compatible with the xception65
network.
During inference, there is no need to do any data augmentation. You could simply set min_scale_factor = max_scale_factor = 1, which is what we do in the provided examples.
Also, if you really think it is a problem, you could add if is_training
before those preprocessing functions.
Adding if is_training
is finally what we did and we get relatively the same result as yours. Thanks for answering.
how to set k? @aquariusjay
k is basically the output size of the feature from any feature extractor network. For example, by setting output_stride = 16 with an input image size as 512, we get k as 512/16 = 32.
Most helpful comment
During eval, we always do whole-image inference, meaning you need to set eval_crop_size >= largest image dimension.
We always set crop_size = output_stride * k + 1, where k is an integer. When working on PASCAL images, the largest dimension is 512. Thus, we set crop_size = 513 = 16 * 32 + 1 > 512. Similarly, we set eval_crop_size = 1025x2049 for Cityscapes images.