Models: object_detection training silently exits without doing anything

Created on 10 Aug 2018 · 13Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
MODEL_DIR=/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): running bundled script object_detection/model_main.py
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

more tf_env.txt

== cat /etc/issue ===============================================
Linux ubuntu 4.15.0-29-generic #31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2
018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.5 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux ubuntu 4.15.0-29-generic #31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2
018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy 1.14.5
protobuf 3.6.0
tensorflow 1.10.0

== check for virtualenv =========================================
True

== tensorflow import ============================================
tf.VERSION = 1.10.0
tf.GIT_VERSION = v1.10.0-0-g656e7a2b34
tf.COMPILER_VERSION = v1.10.0-0-g656e7a2b34
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
./env.sh: line 105: nvidia-smi: command not found

== cuda libs ===================================================

TensorFlow installed from (source or binary): installed using pip3
TensorFlow version (use command below): v1.10.0-0-g656e7a2b34 1.10.0
Bazel version (if compiling from source): n/a
CUDA/cuDNN version:n/a
GPU model and memory:n/a
Exact command to reproduce:

MODEL_DIR=/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28
PIPELINE_CONFIG_PATH=/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco.config
NUM_TRAIN_STEPS=200000
NUM_EVAL_STEPS=2000
python3 object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--num_eval_steps=${NUM_EVAL_STEPS} \
--alsologtostderr

Describe the problem

Trying to train my model using the tf records I've created. Seems like processing never begins, but the pipeline.config file is written to the models directory as the timestamp is updated. Modifying my model config file to include broken references to tf record files does not yield an error, so it seems that the training process doesn't ever begin.

I have modified the model_main.py file to enable logging: tf.logging.set_verbosity(tf.logging.DEBUG)

Running tensorboard in parallel also never observes any activity.

Source code / logs

INFO:tensorflow:Maybe overwriting eval_steps: 2000
INFO:tensorflow:Maybe overwriting train_steps: 200000
INFO:tensorflow:Maybe overwriting retain_original_images_in_eval: True
INFO:tensorflow:Maybe overwriting load_pretrained: True
INFO:tensorflow:Ignoring config override key: load_pretrained
INFO:tensorflow:create_estimator_and_inputs: use_tpu False
INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3c53771ba8>, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tf_random_seed': None, '_task_type': 'worker', '_global_id_in_cluster': 0, '_log_step_count_steps': 100, '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_device_fn': None, '_master': '', '_train_distribute': None, '_is_chief': True, '_task_id': 0, '_model_dir': '/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28', '_session_config': None, '_service': None}
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f3c53712b70>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Writing pipeline config file to /home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28/pipeline.config
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Skipping training since max_steps has already saved.

Source

seth-johnson-sp

Most helpful comment

Oh dear, I found the solution to my issue here the second reply, basically what the guy is saying that remove the existing checkpoint file, I renamed it instead and voila my training has begun, and its also saving checkpoint files now

Also noticed that num_train_steps and num_eval_steps below are ignored

python E:\\Documents\\Projects\\tensorflow\\models2\\research\\object_detection\\model_main.py --alsologtostderr --pipeline_config_path="./training/ssdlite_mobilenet_v2_coco.config" --model_dir="./training" --num_train_steps=300 --NUM_EVAL_STEPS=200

zubairahmed-ai on 25 Nov 2018

👍13 🎉1

All 13 comments

What are you trying to train, specifically what checkpoint are you loading and what is your max_number_of_steps in the config? I am guessing that if you are finetuning from a saved checkpoint, the global step may be getting loaded into the graph therefore the train script thinks that training has been completed even though none of your training iterations have even begun

MichaelX99 on 13 Aug 2018

👍1

@MichaelX99 -- Thank you for your input on this. In the examples provided via tutorials and such, I don't see any info about how to specify the checkpoint, etc. I intentionally named the checkpoint reference in the config file something that doesn't exist on the filesystem and did not trigger an error, nor did that file get created.

Attached is a copy of the config file, and below is a directory listing of the model directory with original checkpoint files, etc. that came with the model.

My goal here is to train a model that can detect burmese pythons in the field.

/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28# ls
checkpoint                      model.ckpt.index  saved_model
frozen_inference_graph.pb       model.ckpt.meta
model.ckpt.data-00000-of-00001  pipeline.config

# SSD with Inception v2 configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes:1 
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
        reduce_boxes_in_lowest_layer: true
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_inception_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 5 
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/home/seth/tensorflow/BurmesePython/ssd_inception_v2_coco_2018_01_28/xxmodel.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/home/seth/tensorflow/BurmesePython/data/train.record"
  }
  label_map_path: "/home/seth/tensorflow/BurmesePython/data/python_label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/home/seth/tensorflow/BurmesePython/data/test.record"
  }
  label_map_path: "/home/seth/tensorflow/BurmesePython/data/python_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

seth-johnson-sp on 16 Aug 2018

Should I take this to StackOverflow? It does seem like a bug in the design of this script since there is no output at the end of running this explaining why the script has completed without doing anything. A default output of what has been done would be useful.

seth-johnson-sp on 17 Aug 2018

I am seeing the same issue with model_main.py, but with a different model. Seems to just hang there not doing anything. Will issue a few matplotlib warnings, but that's about it.

thedodd on 21 Aug 2018

What are you trying to train, specifically what checkpoint are you loading and what is your max_number_of_steps in the config? I am guessing that if you are finetuning from a saved checkpoint, the global step may be getting loaded into the graph therefore the train script thinks that training has been completed even though none of your training iterations have even begun

thanks very much, that's how it is

cl886699 on 18 Sep 2018

MODEL_DIR is output dir, it can be set same as checkpoint dir. Or the checkpoint files overwrited

zzudianzi on 16 Oct 2018

👍4

Facing the exact same issue with model_main.py and I have renamed my training folders, dropped new checkpoints from Model Zoo zip file but still I have the same issue

Here's my error

python E:\\Documents\\Projects\\tensorflow\\models\\research\\object_detection\\model_main.py --alsologtostderr --pipeline_config_path="./_training/ssdlite_mobilenet_v2_coco.config" --model_dir="./_training" --num_train_steps=50100 --NUM_EVAL_STEPS=2000 WARNING:tensorflow:Forced number of epochs for all eval validations to be 1. W1125 11:19:50.079212 12616 tf_logging.py:125] Forced number of epochs for all eval validations to be 1. INFO:tensorflow:Maybe overwriting train_steps: 50100 I1125 11:19:50.081207 12616 tf_logging.py:115] Maybe overwriting train_steps: 50100 INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1 I1125 11:19:50.082204 12616 tf_logging.py:115] Maybe overwriting sample_1_of_n_eval_examples: 1 INFO:tensorflow:Maybe overwriting eval_num_epochs: 1 I1125 11:19:50.083202 12616 tf_logging.py:115] Maybe overwriting eval_num_epochs: 1 INFO:tensorflow:Maybe overwriting load_pretrained: True I1125 11:19:50.084199 12616 tf_logging.py:115] Maybe overwriting load_pretrained: True INFO:tensorflow:Ignoring config override key: load_pretrained I1125 11:19:50.084199 12616 tf_logging.py:115] Ignoring config override key: load_pretrained WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encounteredeval_on_train_input_config.num_epochs= 0. Overwritingnum_epochsto 1. W1125 11:19:50.085196 12616 tf_logging.py:125] Expected number of evaluation epochs is 1, but instead encounteredeval_on_train_input_config.num_epochs= 0. Overwritingnum_epochsto 1. INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False I1125 11:19:50.086194 12616 tf_logging.py:115] create_estimator_and_inputs: use_tpu False, export_to_tpu False INFO:tensorflow:Using config: {'_model_dir': './_training', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000241380956A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I1125 11:19:50.087191 12616 tf_logging.py:115] Using config: {'_model_dir': './_training', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000241380956A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x0000024138093D08>) includes params argument, but params are not passed to Estimator. W1125 11:19:50.088188 12616 tf_logging.py:125] Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x0000024138093D08>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Not using Distribute Coordinator. I1125 11:19:50.088188 12616 tf_logging.py:115] Not using Distribute Coordinator. INFO:tensorflow:Running training and evaluation locally (non-distributed). I1125 11:19:50.095170 12616 tf_logging.py:115] Running training and evaluation locally (non-distributed). INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600. I1125 11:19:50.096168 12616 tf_logging.py:115] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600. INFO:tensorflow:Skipping training since max_steps has already saved. I1125 11:19:50.105143 12616 tf_logging.py:115] Skipping training since max_steps has already saved.

zubairahmed-ai on 25 Nov 2018

Also noticed that num_train_steps and num_eval_steps below are ignored

zubairahmed-ai on 25 Nov 2018

👍13 🎉1

I was also able to follow @zubairahmed-ai's advice and got the model to train. However, the following arguments are not very clear to me:

num_train_steps
sample_1_of_n_eval_examples

The reason why I am asking especially the first one is in my config file, I have specified num_steps parameter in train_config. Should I then remove it from the config file?

sayakpaul on 17 Sep 2019

@sayakpaul Good to see someone from PyImageSearch.com here 🙂

zubairahmed-ai on 17 Sep 2019

👍2

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

Could never get this working. Ended up having to abandon the project.

seth-johnson-sp on 3 Feb 2020

I simply deleted the "checkpoint" file inside the model folder "models/research/object_detection/ssd_mobilenet_v2_coco_2018_03_29" and training starts.