Please answer the following questions for yourself before submitting an issue.
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
When I run model_main_tf2.py using the steps mentioned here, I am getting the following error:
2020-08-31 13:41:33.423364: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
Traceback (most recent call last):
File "/mnt/data/tf_api/models/research/object_detection/model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/mnt/data/tf_api/models/research/object_detection/model_main_tf2.py", line 88, in main
wait_interval=300, timeout=FLAGS.eval_timeout)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 982, in eval_continuously
global_step=global_step)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 791, in eager_eval_loop
eval_dict, losses_dict, class_agnostic = compute_eval_dict(features, labels)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/mnt/data/tf_api/models/research/object_detection/.venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Shapes of all inputs must match: values[0].shape = [2] != values[1].shape = [3]
[[node stack_361 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:155) ]]
[[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression_24/unstack/_4300]]
(1) Invalid argument: Shapes of all inputs must match: values[0].shape = [2] != values[1].shape = [3]
[[node stack_361 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:155) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_282075]
Errors may have originated from an input operation.
Input Source operations connected to node stack_361:
Slice_314 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:273)
Input Source operations connected to node stack_361:
Slice_314 (defined at /.venv/lib/python3.7/site-packages/object_detection/model_lib.py:273)
Function call stack:
compute_eval_dict -> compute_eval_dict
I am inferring on a mask-rcnn model. The config I have used can be found in the pre-trained model folder here.
The script that I run is:
# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH="XXXX/models/research/object_detection/configs/tf2/uhuru.config"
MODEL_DIR="XXXX"
CHECKPOINT_DIR=${MODEL_DIR}
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
nohup python XXXX/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--checkpoint_dir=${CHECKPOINT_DIR} \
--alsologtostderr &
I have checked the record which I am using in the eval process. They are of standard format containing image, bbox coordinates and label.
Eval config that I am using are:
eval_config: {
metrics_set: "coco_detection_metrics"
#metrics_set: "coco_mask_metrics"
eval_instance_masks: false
use_moving_averages: false
batch_size: 50
include_metrics_per_category: true
}
eval_input_reader: {
label_map_path: "XXXXX/models/research/object_detection/data/uhuru.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "XXXXX"
}
load_instance_masks: false
}
I am using this model for detection purposes.
I expect this code to evaluate our model on eval data and plot the necessary stuff on tensorboard.
Training works perfectly. The only problem is evaluation. Also, I wonder why running eval after training has been removed in the tf2 version.
I think your batch size is too big in your config. Try using a batch size of 1.
I've got the same issue using my own project. It worked well when I tested on Oxford Pet Dataset.
This seems to be purely memory dependent.
As suggested above, change the batch size to the smaller value (preferably - use the batch size of 1) and then proceed with executing the procedure.
@DevanshBheda
Can you please reduce the batch size and let us know if the issue still persists. Thanks!
@DevanshBheda
Can you please reduce the batch size and let us know if the issue still persists. Thanks!
Evaluation batch size is 1. It was working when I just had TensorFlow version 2.2 ( Anaconda virtual Environment). I set up another virtual Environment for TensorFlow nightly version, and currently, I've got two separate virtual envs. I tested with both versions, non of them works.
@ravikyram Reducing the batch size worked for me. Thanks for recommending that. This solves the eval script issue. Can you suggest me a way to run this eval script after every training step as per old version? Should I open a new feature for this?
Was having a similar issue, though no explicit mentions of memory issues. Reduced the eval batch size from 8 to 1 et voila! It worked. Thank you all