== cat /etc/issue ===============================================
Linux ubuntu 4.15.0-29-generic #31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2
018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.5 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
No
== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux ubuntu 4.15.0-29-generic #31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2
018 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.14.5
protobuf 3.6.0
tensorflow 1.10.0
== check for virtualenv =========================================
True
== tensorflow import ============================================
tf.VERSION = 1.10.0
tf.GIT_VERSION = v1.10.0-0-g656e7a2b34
tf.COMPILER_VERSION = v1.10.0-0-g656e7a2b34
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
./env.sh: line 105: nvidia-smi: command not found
== cuda libs ===================================================
MODEL_DIR=/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28
PIPELINE_CONFIG_PATH=/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco.config
NUM_TRAIN_STEPS=200000
NUM_EVAL_STEPS=2000
python3 object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--num_eval_steps=${NUM_EVAL_STEPS} \
--alsologtostderr
Trying to train my model using the tf records I've created. Seems like processing never begins, but the pipeline.config file is written to the models directory as the timestamp is updated. Modifying my model config file to include broken references to tf record files does not yield an error, so it seems that the training process doesn't ever begin.
I have modified the model_main.py file to enable logging: tf.logging.set_verbosity(tf.logging.DEBUG)
Running tensorboard in parallel also never observes any activity.
INFO:tensorflow:Maybe overwriting eval_steps: 2000
INFO:tensorflow:Maybe overwriting train_steps: 200000
INFO:tensorflow:Maybe overwriting retain_original_images_in_eval: True
INFO:tensorflow:Maybe overwriting load_pretrained: True
INFO:tensorflow:Ignoring config override key: load_pretrained
INFO:tensorflow:create_estimator_and_inputs: use_tpu False
INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3c53771ba8>, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tf_random_seed': None, '_task_type': 'worker', '_global_id_in_cluster': 0, '_log_step_count_steps': 100, '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_device_fn': None, '_master': '', '_train_distribute': None, '_is_chief': True, '_task_id': 0, '_model_dir': '/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28', '_session_config': None, '_service': None}
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f3c53712b70>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Writing pipeline config file to /home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28/pipeline.config
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Skipping training since max_steps has already saved.
What are you trying to train, specifically what checkpoint are you loading and what is your max_number_of_steps in the config? I am guessing that if you are finetuning from a saved checkpoint, the global step may be getting loaded into the graph therefore the train script thinks that training has been completed even though none of your training iterations have even begun
@MichaelX99 -- Thank you for your input on this. In the examples provided via tutorials and such, I don't see any info about how to specify the checkpoint, etc. I intentionally named the checkpoint reference in the config file something that doesn't exist on the filesystem and did not trigger an error, nor did that file get created.
Attached is a copy of the config file, and below is a directory listing of the model directory with original checkpoint files, etc. that came with the model.
My goal here is to train a model that can detect burmese pythons in the field.
/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28# ls
checkpoint model.ckpt.index saved_model
frozen_inference_graph.pb model.ckpt.meta
model.ckpt.data-00000-of-00001 pipeline.config
# SSD with Inception v2 configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
ssd {
num_classes:1
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
}
}
similarity_calculator {
iou_similarity {
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
reduce_boxes_in_lowest_layer: true
}
}
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
}
}
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
}
}
}
feature_extractor {
type: 'ssd_inception_v2'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
override_base_feature_extractor_hyperparams: true
}
loss {
classification_loss {
weighted_sigmoid {
}
}
localization_loss {
weighted_smooth_l1 {
}
}
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.99
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}
train_config: {
batch_size: 5
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.004
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
fine_tune_checkpoint: "/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28/xxmodel.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/home/seth/tensorflow/BurmesePython/data/train.record"
}
label_map_path: "/home/seth/tensorflow/BurmesePython/data/python_label_map.pbtxt"
}
eval_config: {
num_examples: 8000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/home/seth/tensorflow/BurmesePython/data/test.record"
}
label_map_path: "/home/seth/tensorflow/BurmesePython/data/python_label_map.pbtxt"
shuffle: false
num_readers: 1
}
Should I take this to StackOverflow? It does seem like a bug in the design of this script since there is no output at the end of running this explaining why the script has completed without doing anything. A default output of what has been done would be useful.
I am seeing the same issue with model_main.py
, but with a different model. Seems to just hang there not doing anything. Will issue a few matplotlib warnings, but that's about it.
What are you trying to train, specifically what checkpoint are you loading and what is your max_number_of_steps in the config? I am guessing that if you are finetuning from a saved checkpoint, the global step may be getting loaded into the graph therefore the train script thinks that training has been completed even though none of your training iterations have even begun
thanks very much, that's how it is
MODEL_DIR is output dir, it can be set same as checkpoint dir. Or the checkpoint files overwrited
Facing the exact same issue with model_main.py and I have renamed my training folders, dropped new checkpoints from Model Zoo zip file but still I have the same issue
Here's my error
python E:\\Documents\\Projects\\tensorflow\\models\\research\\object_detection\\model_main.py --alsologtostderr --pipeline_config_path="./_training/ssdlite_mobilenet_v2_coco.config" --model_dir="./_training" --num_train_steps=50100 --NUM_EVAL_STEPS=2000
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
W1125 11:19:50.079212 12616 tf_logging.py:125] Forced number of epochs for all eval validations to be 1.
INFO:tensorflow:Maybe overwriting train_steps: 50100
I1125 11:19:50.081207 12616 tf_logging.py:115] Maybe overwriting train_steps: 50100
INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1
I1125 11:19:50.082204 12616 tf_logging.py:115] Maybe overwriting sample_1_of_n_eval_examples: 1
INFO:tensorflow:Maybe overwriting eval_num_epochs: 1
I1125 11:19:50.083202 12616 tf_logging.py:115] Maybe overwriting eval_num_epochs: 1
INFO:tensorflow:Maybe overwriting load_pretrained: True
I1125 11:19:50.084199 12616 tf_logging.py:115] Maybe overwriting load_pretrained: True
INFO:tensorflow:Ignoring config override key: load_pretrained
I1125 11:19:50.084199 12616 tf_logging.py:115] Ignoring config override key: load_pretrained
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered
eval_on_train_input_config.num_epochs= 0. Overwriting
num_epochsto 1.
W1125 11:19:50.085196 12616 tf_logging.py:125] Expected number of evaluation epochs is 1, but instead encountered
eval_on_train_input_config.num_epochs= 0. Overwriting
num_epochsto 1.
INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False
I1125 11:19:50.086194 12616 tf_logging.py:115] create_estimator_and_inputs: use_tpu False, export_to_tpu False
INFO:tensorflow:Using config: {'_model_dir': './_training', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000241380956A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
I1125 11:19:50.087191 12616 tf_logging.py:115] Using config: {'_model_dir': './_training', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000241380956A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x0000024138093D08>) includes params argument, but params are not passed to Estimator.
W1125 11:19:50.088188 12616 tf_logging.py:125] Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x0000024138093D08>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Not using Distribute Coordinator.
I1125 11:19:50.088188 12616 tf_logging.py:115] Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
I1125 11:19:50.095170 12616 tf_logging.py:115] Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
I1125 11:19:50.096168 12616 tf_logging.py:115] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Skipping training since max_steps has already saved.
I1125 11:19:50.105143 12616 tf_logging.py:115] Skipping training since max_steps has already saved.
Oh dear, I found the solution to my issue here the second reply, basically what the guy is saying that remove the existing checkpoint
file, I renamed it instead and voila my training has begun, and its also saving checkpoint files now
Also noticed that num_train_steps
and num_eval_steps
below are ignored
python E:\\Documents\\Projects\\tensorflow\\models2\\research\\object_detection\\model_main.py --alsologtostderr --pipeline_config_path="./training/ssdlite_mobilenet_v2_coco.config" --model_dir="./training" --num_train_steps=300 --NUM_EVAL_STEPS=200
I was also able to follow @zubairahmed-ai's advice and got the model to train. However, the following arguments are not very clear to me:
num_train_steps
sample_1_of_n_eval_examples
The reason why I am asking especially the first one is in my config file, I have specified num_steps
parameter in train_config
. Should I then remove it from the config file?
@sayakpaul Good to see someone from PyImageSearch.com here 馃檪
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Could never get this working. Ended up having to abandon the project.
I simply deleted the "checkpoint" file inside the model folder "models/research/object_detection/ssd_mobilenet_v2_coco_2018_03_29" and training starts.
Most helpful comment
Oh dear, I found the solution to my issue here the second reply, basically what the guy is saying that remove the existing
checkpoint
file, I renamed it instead and voila my training has begun, and its also saving checkpoint files nowAlso noticed that
num_train_steps
andnum_eval_steps
below are ignoredpython E:\\Documents\\Projects\\tensorflow\\models2\\research\\object_detection\\model_main.py --alsologtostderr --pipeline_config_path="./training/ssdlite_mobilenet_v2_coco.config" --model_dir="./training" --num_train_steps=300 --NUM_EVAL_STEPS=200