Please go to Stack Overflow for help and support:
http://stackoverflow.com/questions/tagged/tensorflow
Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
I am trying to use transfer learning to train a model for the open images challenge. I prepared the data as tfrecord files. I downloaded faster_rcnn_inception_resnet_v2_atrous_oid from the model zoo. I created a config by modifying the number of classes and paths. When I run model_main.py to start training, it fails with the following exception:
Traceback (most recent call last):
File "/models/research/object_detection/model_main.py", line 101, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/models/research/object_detection/model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 447, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 531, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 669, in run_local
hooks=train_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -54
[[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32, _device="/device:CPU:0"](cond_2/Merge, stack_9)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,3], [1,100], [1,100,4], [1,100,500], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLO
AT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
This is the config I used:
model {
faster_rcnn {
num_classes: 500
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: "faster_rcnn_inception_resnet_v2"
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
height_stride: 8
width_stride: 8
scales: 0.25
scales: 0.5
scales: 1.0
scales: 2.0
aspect_ratios: 0.5
aspect_ratios: 1.0
aspect_ratios: 2.0
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.00999999977648
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.699999988079
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
use_dropout: false
dropout_keep_probability: 1.0
}
}
second_stage_batch_size: 20
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.300000011921
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config {
batch_size: 1
optimizer {
momentum_optimizer {
learning_rate {
manual_step_learning_rate {
initial_learning_rate: 5.99999964379e-07
schedule {
step: 1000
learning_rate: 5.99999984843e-05
}
schedule {
step: 60000
learning_rate: 6.00000021223e-06
}
schedule {
step: 70000
learning_rate: 6.00000021223e-07
}
}
}
momentum_optimizer_value: 0.899999976158
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/data/object_detection_models/faster_rcnn_inception_resnet_v2_atrous_oid_2018_01_28/model.ckpt"
num_steps: 1000
load_all_detection_checkpoint_vars: true
fine_tune_checkpoint_type: "detection"
}
train_input_reader {
label_map_path: "/models/research/object_detection/data/oid_object_detection_challenge_500_label_map.pbtxt"
num_readers: 1
tf_record_input_reader {
input_path: "/data/images/train_tfrecords/tfrecord-00000-of-00001"
}
}
eval_config {
num_examples: 1000
max_evals: 10
metrics_set: "open_images_metrics"
use_moving_averages: false
retain_original_images: true
}
eval_input_reader {
label_map_path: "/models/research/object_detection/data/oid_object_detection_challenge_500_label_map.pbtxt"
shuffle: false
num_readers: 1
tf_record_input_reader {
input_path: "/data/images/validation_tfrecords/tfrecord-00000-of-00001"
}
}
This is the complete output from model_main.py:
/usr/local/lib/python2.7/dist-packages/object_detection/utils/visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.
The backend was *originally* set to 'TkAgg' by the following code:
File "/models/research/object_detection/model_main.py", line 26, in <module>
from object_detection import model_lib
File "/usr/local/lib/python2.7/dist-packages/object_detection/model_lib.py", line 26, in <module>
from object_detection import eval_util
File "/usr/local/lib/python2.7/dist-packages/object_detection/eval_util.py", line 28, in <module>
from object_detection.metrics import coco_evaluation
File "/usr/local/lib/python2.7/dist-packages/object_detection/metrics/coco_evaluation.py", line 20, in <module>
from object_detection.metrics import coco_tools
File "/usr/local/lib/python2.7/dist-packages/object_detection/metrics/coco_tools.py", line 47, in <module>
from pycocotools import coco
File "build/bdist.linux-x86_64/egg/pycocotools/coco.py", line 49, in <module>
import matplotlib.pyplot as plt
File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in <module>
from matplotlib.backends import pylab_setup
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/__init__.py", line 16, in <module>
line for line in traceback.format_stack()
import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7f09e1527488>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/box_predictor.py:407: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py:2037: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [global_step] is not available in checkpoint
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/losses.py:317: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See @{tf.nn.softmax_cross_entropy_with_logits_v2}.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/object_detection/core/losses.py:317: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See @{tf.nn.softmax_cross_entropy_with_logits_v2}.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-07-17 12:48:43.762318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-17 12:48:44.053889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:05:00.0
totalMemory: 7.93GiB freeMemory: 7.81GiB
2018-07-17 12:48:44.053946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-17 12:48:44.250098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 12:48:44.250181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2018-07-17 12:48:44.250200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2018-07-17 12:48:44.250436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7541 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:05:00.0
, compute capability: 6.1)
2018-07-17 12:49:28.560860: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:28.562139: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.56GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.242122: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.267979: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.300060: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:38.651405: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.62GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.910921: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.935985: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:42.967801: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-07-17 12:49:45.355076: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "/models/research/object_detection/model_main.py", line 101, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/models/research/object_detection/model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 447, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 531, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 669, in run_local
hooks=train_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -54
[[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32, _device="/device:CPU:0"](cond_2/Merge, stack_9)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,3], [1,100], [1,100,4], [1,100,500], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
I'm also curious about this issue; I'm receiving an analogous error message and, like dnuffer, I'm getting the message when using custom tfrecord files. In my case, I've packaged everything into a docker image, mawah/debug:tfodapiretinanet, built off of tensorflow/tensorflow:1.8.0-gpu and commit e2d463713d031f93b5916d260df82565d336068a of this repository. The command to reproduce is:
python /retinanet/models/research/object_detection/model_main.py
--pipeline_config_path=/retinanet/scripts/retinanet.config
--model_dir=/retinanet/output_model
--num_train_steps=25000
--num_eval_steps=8000
--alsologtostderr
I've noticed that my error message is not exactly the same if I run the command twice -- the negative integer in the first line below changes each time.
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -10
[[Node: Pad_9 = Pad[T=DT_FLOAT, Tpaddings=DT_INT32](cond_2/Merge, stack_9)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[64], [64,640,640,3], [64,3], [64,100], [64,100,4], [64,100,60], [64,100], [64,100], [64,100], [64]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
[[Node: IteratorGetNext/_3911 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_892_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Thanks very much for your help!
I also see the negative integer change every run.
I tried using both image resizers (keep_aspect_ratio_resizer and fixed_shape_resizer), but that didn't make a difference.
I tried using images that were all resized to 320x240 in the tfrecords, but that also didn't make a difference.
I thought this might be related to data augmentation, so I tried using various combinations of no augmentation, random_crop_pad_image, random_crop_image, and random_horizontal_flip, but every combination also yielded this same InvalidArgumentError.
@dnuffer Also found changing values and started looking to our image dataset.
Initially suspected a problem with cropping or normalization of bounding boxes or h,w versus w,h image size conventions.
Filtered out all but the square images.
With majority 500x500 images, I filtered out the few 300x300.
In the config, I replaced the crop dimensions 300x300 with 500x500 (no crop).
This allowed me to begin training but eventually the InvalidArgumentError reappears.
Will be on the lookout for the fix!
I tried using 500x500 images together with
image_resizer {
fixed_shape_resizer {
height: 500
width: 500
}
}
in my config, but unfortunately I still got the error.
Experience same issue with using pre-trained Faster-RCNN models on Open Images Challenge 2018 dataset.
Most Faster-RCNNs listed in the zoo will lead me to almost the same error (while other models give me some other Errors so I can not try.)
For example, with pre-trained "faster_rcnn_inception_resnet_v2_atrous_oid" and the following config will generate ...InvalidArgumentError: Paddings must be non-negative: ... [[Node: Pad_9 = Pad...]].
model {
faster_rcnn {
num_classes: 500
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: "faster_rcnn_inception_resnet_v2"
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.00006
schedule {
step: 100
learning_rate: .000006
}
schedule {
step: 1000
learning_rate: .0000006
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
unpad_groundtruth_tensors: false
from_detection_checkpoint: True
fine_tune_checkpoint: ".../pretrained-model.ckpt"
}
train_input_reader: {
tf_record_input_reader {
input_path: "/....tfrecord"
}
label_map_path: ".../label_map.pbtxt"
}
eval_config: {
metrics_set: "coco_detection_metrics"
num_examples: 1000
max_evals: 5
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/....tfrecord"
}
label_map_path: ".../label_map.pbtxt"
shuffle: false
num_readers: 1
}
I also met this error when I trained on my customized data. It turned out that some of the bounding boxes of target objects were too small inside my training image. When I filtered out those small bounding boxes and kept larger ones, I could run without such errors.
Whoever read this and still hasn't been successful in training your custom data set, abandon model_main.py and go back using the train.py. I got error using model_main.py and I decided to use train.py, everything works like a charm. I even got awesome result. Good luck
I'm having the same issue. Could it be something to do with overlapping label boxes of same class?
It has been 60 hours of training without the InvalidArgumentError after filtering out all the boxes where box width < image width / 20, same for height.
I believe @tenoyart 's suggestion was at the heart of it.
I want to fine-tune 'ssd_inception_v2_coco_2018_01_28' model with 'open image dataset v4'.
The same error occurred......
So could you tell me how to filter out all the small boxes? or provide sample codes?
@mayorquinmachines @tenoyart
Since the TFrecords of open image data are already generated, i can't regenerate this dataset by pre-removing small boxes because its too large.
How can i remove those boxes when i read a TF example?
I am able to use legacy/{train,eval}.py
with the exact same config and dataset without any problem, so this problem seems to be related to something that model_main.py
is doing differently.
We used labelimg and so we made use of ElementTree to load the xml files.
Something like
xml_file = '/path/to/annotation.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for obj in root.findall('object'):
for bx in obj.findall('bndbox'):
bx_coord = {coord.tag: int(coord.text) for coord in bx.getchildren()}
This puts us at a point where we can look at (ymax - ymin) / height and set a threshold like 0.05.
Likewise for xmax, xmin and we use element tree to remove the object and then overwrite the file.
For the csv files, you might load the data into a pandas dataframe, add a column as the difference between ymax/ymin and xmax/xmin, then a query rows where the difference is big enough to keep, write to file and rebuild the tf-records.
I agree with @dnuffer.
I had the same issue, toggling augmentations and removing small boxes did not help in my case.
But I am able to use the legacy/{train,eval}.py
with the very same configuration.
Any tf contributor can solve this? Cause legacy/train.py is able to train, which means model_main.py has something unpredictable
@tenoyart hello, could you explain why it can run without small bounding boxes. I'm very confused. Thank you.
I'm looking into this now.
Changes in my recent PR should be able to fix this issue. Please sync your fork to latest and re-run once the PR gets merged. Thanks!
@pkulzc I pulled the latest changes but now I get different error. Hope you can help!
https://github.com/tensorflow/tensorflow/issues/21320
The issue is not fixed yet, I still get the error "Expected size[0] in [0, 100], but got 225"
I solve the issue by enlarging the parameter value, max_number_of_boxes which is in input.proto, then it runs well
I was getting the error "Expected size[0] in [0, 100], but got 225" after pulling and was able to solve it by explicitly specifying a large value for max_number_of_boxes here:
https://github.com/tensorflow/models/blob/master/research/object_detection/inputs.py#L391
Thank you. I believe an official fix is due in next release. Until then I am using legacy/train.py for my training.
I sent another PR that should be able to fix the "Expected size[0] in [0, 100], but got 225" issue. A relevant FAQ question is also added to help understand this issue.
meet same problem here, hope this could fix soon
sdk)
D:\Lesson8\models\research\object_detection>python model_main.py --logtostderr--train_dir=training/--pipeline_config_path=training/faster_rcnn_inception_v2_pets.config
Traceback (most recent call last):
File "model_main.py", line 26, in
from object_detection import model_lib
File "D:\Lesson8\models\research\object_detection\model_lib.py", line 27, in
from object_detection import eval_util
File "D:\Lesson8\models\research\object_detection\eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "D:\Lesson8\models\research\object_detection\metrics\coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "D:\Lesson8\models\research\object_detection\metrics\coco_tools.py", line 47, in
from pycocotools import coco
File "D:\Lesson8\models\research\pycocotools\coco.py", line 55, in
from . import mask as maskUtils
File "D:\Lesson8\models\research\pycocotools\mask.py", line 3, in
import pycocotools._mask as _mask
ModuleNotFoundError: No module named 'pycocotools._mask'`
Most helpful comment
Changes in my recent PR should be able to fix this issue. Please sync your fork to latest and re-run once the PR gets merged. Thanks!