You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
I have success on using SSD mobilenet, training on custom objects. 1000+ images, all 700x700 pix, 80% for training and 20% for testing (so shouldn't be any problem with installation). But the accuracy is bad, so I switch to using faster_rcnn_nas. As accuracy is my top priority.
I have tried:
batch_queue_capacity: 2
num_batch_queue_threads: 8
prefetch_queue_capacity: 2
but not helping.
The training stuck at global_step/sec: 0. And some time, I get the value of loss (around 2.6). I trained and waited 4 hours, only manage to get only 3 values of losses, super slow. Other than that, I get global_step/sec: 0, and still it's extremely slow.
num_examples is 204 because I have 204 images (20%) in my test folder.
I have tried change the fixed_shape_resizer too, to 700 x 700, but doesn't seems to make the training progress too.
model {
faster_rcnn {
num_classes: 1
image_resizer {
# TODO: Only fixed_shape_resizer is currently supported for NASNet
# featurization. The reason for this is that nasnet.py only supports
# inputs with fully known shapes. We need to update nasnet.py to handle
# shapes not known at compile time.
fixed_shape_resizer {
height: 1200
width: 1200
}
}
feature_extractor {
type: 'faster_rcnn_nas'
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 50
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 49
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
batch_queue_capacity: 2
num_batch_queue_threads: 8
prefetch_queue_capacity: 2
fine_tune_checkpoint: "faster_rcnn_nas_coco_2017_11_08/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "data/train.record"
}
label_map_path: "data/ant-detection.pbtxt"
}
eval_config: {
metrics_set: "pascal_voc_metrics"
num_examples: 204
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10}
eval_input_reader: {
tf_record_input_reader {
input_path: "data/test.record"
}
label_map_path: "training/ant-detection.pbtxt"
shuffle: false
num_readers: 1
num_epochs: 1
}
Please help :(
Is your GPU busy during this time (in which case it is possible that the hardware you're training on is just not powerful enough)?
This is interesting, I see that you have tried reducing the batch size, how about reducing the capacity of your model? Use fewer layers,parameters etc and train again. If it does run this time in a shorter time, maybe like @asimshankar said your gpu is not powerful enough.
Hope this helps!
Hmm, yeah I'm having this problem just now as well. I've tried running on CPUs both locally and on the gcloud ml-engine, with tensorflow v1.7.latest as well as v1.6. Same for both Python2.7 and Python3.5.
My training system is not new, and has worked without this problem previously. It seems likely that this has something to do with a recent change, perhaps related to the model repository.
Although it appears to be related to the depreciated usage of training_util.get_or_create_global_step, this is a simple function-forwarding, and can be ignored as regards to this issue.
@asimshankar My GPU does nothing at that time except tensorflow. I think u are right, my GPU is not powerful enough. I have tried faster_rcnn_inception_resnet_v2_atrous_coco, it runs but very slow.
@sk-g the batch_size is already at minimum - 1. Do you mean queue capacity? I have tried reduce it but its not working too.
@npeirson I see. Thank you for informing me about this update!
Hello friends!
This may be user error. For me, it was simply that I'd forgotten to create new TFRecords files after adding to the dataset.
Hope that helps!
I had a similar error, make sure you have created the TF record files correctly. In my case, there was a small bug in the create_tf_record
script which created empty TF record files.
I'm under the impression that this is resolved, so closing it. Please re-open if I'm misunderstood and if there is more information that can be provided
@think-data What was the error in create_tf_record? I am having this problem as well
if you're getting global_step/sec: 0
caused by tfrecords problems, verify all components of your tfrecords. Specifically, I'd batch format all the data to a uniform setting, even if you think they're already uniform. I recommend Bulk Image Formatter (freeware). Format problems, incorrect number of channels, etc---these are common causes for this problem. also, as was my problem once, check that your labels match your files. :)
Wonderful, was having the same issue due to corrupted tfrecord files...finally got the issue
Most helpful comment
I had a similar error, make sure you have created the TF record files correctly. In my case, there was a small bug in the
create_tf_record
script which created empty TF record files.