Models: Error while running model_main_tf2.py in eval mode.

Created on 31 Aug 2020 · 7Comments · Source: tensorflow/models

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Y ] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[Y] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[Y] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

When I run model_main_tf2.py using the steps mentioned here, I am getting the following error:

2020-08-31 13:41:33.423364: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
Traceback (most recent call last):
  File "/mnt/data/tf_api/models/research/object_detection/model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/mnt/data/tf_api/models/research/object_detection/model_main_tf2.py", line 88, in main
    wait_interval=300, timeout=FLAGS.eval_timeout)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 982, in eval_continuously
    global_step=global_step)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 791, in eager_eval_loop
    eval_dict, losses_dict, class_agnostic = compute_eval_dict(features, labels)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Shapes of all inputs must match: values[0].shape = [2] != values[1].shape = [3]
         [[node stack_361 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:155) ]]
         [[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression_24/unstack/_4300]]
  (1) Invalid argument:  Shapes of all inputs must match: values[0].shape = [2] != values[1].shape = [3]
         [[node stack_361 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:155) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_282075]

Errors may have originated from an input operation.
Input Source operations connected to node stack_361:
 Slice_314 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:273)

Input Source operations connected to node stack_361:
 Slice_314 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:273)

Function call stack:
compute_eval_dict -> compute_eval_dict

I am inferring on a mask-rcnn model. The config I have used can be found in the pre-trained model folder here.

3. Steps to reproduce

The script that I run is:

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH="XXXX/models/research/object_detection/configs/tf2/uhuru.config"
MODEL_DIR="XXXX"
CHECKPOINT_DIR=${MODEL_DIR}
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
nohup python XXXX/models/research/object_detection/model_main_tf2.py \
        --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
        --model_dir=${MODEL_DIR} \
        --checkpoint_dir=${CHECKPOINT_DIR} \
        --alsologtostderr &

I have checked the record which I am using in the eval process. They are of standard format containing image, bbox coordinates and label.

Eval config that I am using are:

eval_config: {
  metrics_set: "coco_detection_metrics"
  #metrics_set: "coco_mask_metrics"
  eval_instance_masks: false
  use_moving_averages: false
  batch_size: 50
  include_metrics_per_category: true
}

eval_input_reader: {
  label_map_path: "XXXXX/models/research/object_detection/data/uhuru.pbtxt"
  shuffle: false
  num_epochs: 1

  tf_record_input_reader {
    input_path: "XXXXX"
  }
  load_instance_masks: false
}

I am using this model for detection purposes.

4. Expected behavior

I expect this code to evaluate our model on eval data and plot the necessary stuff on tensorboard.

5. Additional context

Training works perfectly. The only problem is evaluation. Also, I wonder why running eval after training has been removed in the tf2 version.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9.12
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary): Installed as mentioned here
TensorFlow version (use command below): 2.3.0
Python version: 3.7.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 10.1.243
GPU model and memory: 8 x NVIDIA Tesla V100

research bug

Source

DevanshBheda

All 7 comments

I think your batch size is too big in your config. Try using a batch size of 1.

yang07ly on 1 Sep 2020

👍1

I've got the same issue using my own project. It worked well when I tested on Oxford Pet Dataset.

Parham-khj on 1 Sep 2020

This seems to be purely memory dependent.
As suggested above, change the batch size to the smaller value (preferably - use the batch size of 1) and then proceed with executing the procedure.

Panaroja on 2 Sep 2020

👍1

@DevanshBheda

Can you please reduce the batch size and let us know if the issue still persists. Thanks!

ravikyram on 2 Sep 2020

@DevanshBheda

Can you please reduce the batch size and let us know if the issue still persists. Thanks!

Evaluation batch size is 1. It was working when I just had TensorFlow version 2.2 ( Anaconda virtual Environment). I set up another virtual Environment for TensorFlow nightly version, and currently, I've got two separate virtual envs. I tested with both versions, non of them works.

Parham-khj on 2 Sep 2020

@ravikyram Reducing the batch size worked for me. Thanks for recommending that. This solves the eval script issue. Can you suggest me a way to run this eval script after every training step as per old version? Should I open a new feature for this?

DevanshBheda on 2 Sep 2020

Was having a similar issue, though no explicit mentions of memory issues. Reduced the eval batch size from 8 to 1 et voila! It worked. Thank you all