Models: [Deeplab] input_preprocess doesn't give the correct shape in evaluation mode

Created on 10 Apr 2018 · 8Comments · Source: tensorflow/models

Describe the problem

Here is a snippet from the input_preprocess.py in deeplab model (from line 122)

# Randomly crop the image and label.
  if is_training and label is not None:
    processed_image, label = preprocess_utils.random_crop(
        [processed_image, label], crop_height, crop_width)

  processed_image.set_shape([crop_height, crop_width, 3])

  if label is not None:
    label.set_shape([crop_height, crop_width, 1])

  if is_training:
    # Randomly left-right flip the image and label.
    processed_image, label, _ = preprocess_utils.flip_dim(
        [processed_image, label], _PROB_OF_FLIP, dim=1)

Obviously the crop is only performed during the train mode but the processed_image shape is set to [crop_height, crop_width].

This cause a problem when we evaluate the xception65 model which produces the following error

InvalidArgumentError (see above for traceback): padded_shape[0]=45 is not divisible by block_shape[0]=2
     [[Node: xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](xception_65/exit_flow/block1/unit_1/xception_module/add, xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND/block_shape, xception_65/exit_flow/block2/unit_1/xception_module/separable_conv1_depthwise/depthwise/SpaceToBatchND/paddings)]]

If we force the size of the input image to be [513, 513] it works.

This test was done with Pascal VOC data set.

Source

jrabary

Most helpful comment

During eval, we always do whole-image inference, meaning you need to set eval_crop_size >= largest image dimension.

We always set crop_size = output_stride * k + 1, where k is an integer. When working on PASCAL images, the largest dimension is 512. Thus, we set crop_size = 513 = 16 * 32 + 1 > 512. Similarly, we set eval_crop_size = 1025x2049 for Cityscapes images.

aquariusjay on 10 Apr 2018

👍19 😄2

All 8 comments

During eval, we always do whole-image inference, meaning you need to set eval_crop_size >= largest image dimension.

aquariusjay on 10 Apr 2018

👍19 😄2

Thanks @aquariusjay. We follow exactly your parameters on Pascal VOC. The crop size is set to 513. We notice that in the pre-processing code the image is randomly scaled even in eval mode. Is that correct ? After this data augmentation, the image size can be for example [670 1000 3] and this causes the error on xception65 forward function.

jrabary on 10 Apr 2018

If you need multi-scale inputs during inference, please call this function
https://github.com/tensorflow/models/blob/master/research/deeplab/model.py#L91
And this should have already been handled in eval.py
https://github.com/tensorflow/models/blob/master/research/deeplab/eval.py#L112

Do not use the pre-processing for multi-scale inputs during inference.

aquariusjay on 10 Apr 2018

This problem appears when we perform single-scale test. We do not explicitly call pre-processing function during the evaluation, it is called in the get of input_generator

https://github.com/tensorflow/models/blob/2661eb977d454746e26452d5fec7c5edafb76fe3/research/deeplab/utils/input_generator.py#L134

And if you take a look at this function, the data augmentation is always performed even in eval mode

https://github.com/tensorflow/models/blob/2661eb977d454746e26452d5fec7c5edafb76fe3/research/deeplab/input_preprocess.py#L97

We believe that this can be problematic, and in fact during the evaluation the input image can have a shape that is not compatible with the xception65 network.

jrabary on 11 Apr 2018

During inference, there is no need to do any data augmentation. You could simply set min_scale_factor = max_scale_factor = 1, which is what we do in the provided examples.

Also, if you really think it is a problem, you could add if is_training before those preprocessing functions.

aquariusjay on 11 Apr 2018

Adding if is_training is finally what we did and we get relatively the same result as yours. Thanks for answering.

jrabary on 13 Apr 2018

how to set k? @aquariusjay

95xueqian on 29 May 2018

k is basically the output size of the feature from any feature extractor network. For example, by setting output_stride = 16 with an input image size as 512, we get k as 512/16 = 32.