Models: TRAINING IS TAKING FOREVER - global_step/sec: 0, faster_rcnn_nas

Created on 4 Apr 2018 · 10Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:faster_rcnn_nas
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Windows 7
TensorFlow installed from (source or binary):source (GPU version)
TensorFlow version (use command below):1.4.0
Bazel version (if compiling from source):
CUDA/cuDNN version:2.0
GPU model and memory:GeForce GT540M 2GB
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

I have success on using SSD mobilenet, training on custom objects. 1000+ images, all 700x700 pix, 80% for training and 20% for testing (so shouldn't be any problem with installation). But the accuracy is bad, so I switch to using faster_rcnn_nas. As accuracy is my top priority.

I have tried:
batch_queue_capacity: 2
num_batch_queue_threads: 8
prefetch_queue_capacity: 2
but not helping.
The training stuck at global_step/sec: 0. And some time, I get the value of loss (around 2.6). I trained and waited 4 hours, only manage to get only 3 values of losses, super slow. Other than that, I get global_step/sec: 0, and still it's extremely slow.

num_examples is 204 because I have 204 images (20%) in my test folder.

I have tried change the fixed_shape_resizer too, to 700 x 700, but doesn't seems to make the training progress too.

Source code / logs

Faster R-CNN with NASNet-A featurization

Configured for MSCOCO Dataset.

Users should configure the fine_tune_checkpoint field in the train config as

well as the label_map_path and input_path fields in the train_input_reader and

eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that

should be configured.

model {
faster_rcnn {
num_classes: 1
image_resizer {
# TODO: Only fixed_shape_resizer is currently supported for NASNet
# featurization. The reason for this is that nasnet.py only supports
# inputs with fully known shapes. We need to update nasnet.py to handle
# shapes not known at compile time.
fixed_shape_resizer {
height: 1200
width: 1200
}
}
feature_extractor {
type: 'faster_rcnn_nas'
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 50
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 49
}
}

train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
batch_queue_capacity: 2
num_batch_queue_threads: 8
prefetch_queue_capacity: 2
fine_tune_checkpoint: "faster_rcnn_nas_coco_2017_11_08/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}

train_input_reader: {
tf_record_input_reader {
input_path: "data/train.record"
}
label_map_path: "data/ant-detection.pbtxt"
}

eval_config: {
metrics_set: "pascal_voc_metrics"
num_examples: 204
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10}

eval_input_reader: {
tf_record_input_reader {
input_path: "data/test.record"
}
label_map_path: "training/ant-detection.pbtxt"
shuffle: false
num_readers: 1
num_epochs: 1
}

Please help :(

awaiting response

Source

JayJoeSeventh

Most helpful comment

I had a similar error, make sure you have created the TF record files correctly. In my case, there was a small bug in the create_tf_record script which created empty TF record files.

tdchaitanya on 15 Aug 2018

👍11 ❤1

All 10 comments

Is your GPU busy during this time (in which case it is possible that the hardware you're training on is just not powerful enough)?

asimshankar on 7 Apr 2018

This is interesting, I see that you have tried reducing the batch size, how about reducing the capacity of your model? Use fewer layers,parameters etc and train again. If it does run this time in a shorter time, maybe like @asimshankar said your gpu is not powerful enough.

Hope this helps!

sk-g on 9 Apr 2018

Hmm, yeah I'm having this problem just now as well. I've tried running on CPUs both locally and on the gcloud ml-engine, with tensorflow v1.7.latest as well as v1.6. Same for both Python2.7 and Python3.5.

My training system is not new, and has worked without this problem previously. It seems likely that this has something to do with a recent change, perhaps related to the model repository.

Although it appears to be related to the depreciated usage of training_util.get_or_create_global_step, this is a simple function-forwarding, and can be ignored as regards to this issue.

npeirson on 9 Apr 2018

@asimshankar My GPU does nothing at that time except tensorflow. I think u are right, my GPU is not powerful enough. I have tried faster_rcnn_inception_resnet_v2_atrous_coco, it runs but very slow.

@sk-g the batch_size is already at minimum - 1. Do you mean queue capacity? I have tried reduce it but its not working too.

@npeirson I see. Thank you for informing me about this update!

JayJoeSeventh on 9 Apr 2018

Hello friends!

This may be user error. For me, it was simply that I'd forgotten to create new TFRecords files after adding to the dataset.

Hope that helps!

npeirson on 11 Apr 2018

👍3 🎉1

I had a similar error, make sure you have created the TF record files correctly. In my case, there was a small bug in the create_tf_record script which created empty TF record files.

tdchaitanya on 15 Aug 2018

👍11 ❤1

I'm under the impression that this is resolved, so closing it. Please re-open if I'm misunderstood and if there is more information that can be provided

asimshankar on 16 Aug 2018

@think-data What was the error in create_tf_record? I am having this problem as well

MarlinAdIII on 20 Nov 2018

if you're getting global_step/sec: 0 caused by tfrecords problems, verify all components of your tfrecords. Specifically, I'd batch format all the data to a uniform setting, even if you think they're already uniform. I recommend Bulk Image Formatter (freeware). Format problems, incorrect number of channels, etc---these are common causes for this problem. also, as was my problem once, check that your labels match your files. :)

npeirson on 20 Nov 2018

Wonderful, was having the same issue due to corrupted tfrecord files...finally got the issue