Models: TF2 Evaluation is not done automatically during the training

Created on 25 Oct 2020 · 4Comments · Source: tensorflow/models

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

Evaluation is not performed automatically every 300sec during the training.

3. Steps to reproduce

Im using the TF2 with GPU

python model_main_tf2.py --model_dir=path\to\model\dir --pipeline_config_path=path\to\pipeline.config

4. Expected behavior

Run the COCO evaluation the same way it did when I used TFOD with TF1

5. Additional context

Im using: tensorflow-gpu 2.3.0
The commit that Im using: a26d77c47b319c367c2a81098eee72d9373cdc91

My pipeline.config:

model {
  ssd {
    num_classes: 3
    image_resizer {
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2_fpn_keras"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          random_normal_initializer {
            mean: 0.0
            stddev: 0.009999999776482582
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.996999979019165
          scale: true
          epsilon: 0.0010000000474974513
        }
      }
      use_depthwise: true
      override_base_feature_extractor_hyperparams: true
      fpn {
        min_level: 3
        max_level: 7
        additional_layer_depth: 128
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.009999999776482582
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.996999979019165
            scale: true
            epsilon: 0.0010000000474974513
          }
        }
        depth: 128
        num_layers_before_predictor: 4
        kernel_size: 3
        class_prediction_bias_init: -4.599999904632568
        share_prediction_tower: true
        use_depthwise: true
      }
    }
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 2
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
        use_static_shapes: false
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.25
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
  }
}
train_config {
  batch_size: 15
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.07999999821186066
          total_steps: 50000
          warmup_learning_rate: 0.026666000485420227
          warmup_steps: 1000
        }
      }
      momentum_optimizer_value: 0.8999999761581421
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "C:\ObjectDetection\FaceMaskDetection\Zoo\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8\checkpoint\ckpt-0"
  num_steps: 20000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  fine_tune_checkpoint_version: V2
}
train_input_reader {
  label_map_path: "C:/ObjectDetection/FaceMaskDetection/Dataset/TFRecord/label_map.txt"
  tf_record_input_reader {
    input_path: "C:/ObjectDetection/FaceMaskDetection/Dataset/TFRecord/train.record"
  }
}
eval_config {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "C:/ObjectDetection/FaceMaskDetection/Dataset/TFRecord/label_map.txt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "C:/ObjectDetection/FaceMaskDetection/Dataset/TFRecord/eval.record"
  }
}

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device name if the issue happens on a mobile device: NO
TensorFlow installed from (source or binary): using pip
TensorFlow version (use command below): tensorflow-gpu 2.3.0
Python version: Python 3.6.12
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: cuda_10.1 cudnn64_7
GPU model and memory: NVIDIA GeForce RTX 2070 SUPER

research bug

Source

horczech

👍4

Most helpful comment

Ok, I solved the problem.

If you look closely at model_main_tf2.py you will find out that you can either run the evaluation when you specify the
FLAGS.checkpoint_dir. When you don't specify it will run the training loop. You cant run both with the current implementation.

  if FLAGS.checkpoint_dir:
    model_lib_v2.eval_continuously(...) <----------------------------------- Evaluation 
  else:
    if FLAGS.use_tpu:
      resolver = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)
      tf.config.experimental_connect_to_cluster(resolver)
      tf.tpu.experimental.initialize_tpu_system(resolver)
      strategy = tf.distribute.experimental.TPUStrategy(resolver)
    elif FLAGS.num_workers > 1:
      strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    else:
      strategy = tf.compat.v2.distribute.MirroredStrategy()

    with strategy.scope():
      model_lib_v2.train_loop(...) <-----------------------------------Training

The workaround that I found here is to run the evaluation in the parallel command prompt. By running python object_detection/model_main_tf2.py --checkpoint_dir <same path as model_dir> --model_dir <the model_dir you passed in the training process> --pipeline_config_path <path to the pipeline.config file you're training with>

Make sure to disable GPU for the evaluation script using the set CUDA_VISIBLE_DEVICES=-1 in the command prompt otherwise it fails on some GPU allocation. This way it works like a champ!! 👊

horczech on 1 Nov 2020

👍2

All 4 comments

Can confirm. Having the same issue after switching from TF1 to TF2

alexdwu13 on 28 Oct 2020

👍1

Ok, I solved the problem.

  if FLAGS.checkpoint_dir:
    model_lib_v2.eval_continuously(...) <----------------------------------- Evaluation 
  else:
    if FLAGS.use_tpu:
      resolver = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)
      tf.config.experimental_connect_to_cluster(resolver)
      tf.tpu.experimental.initialize_tpu_system(resolver)
      strategy = tf.distribute.experimental.TPUStrategy(resolver)
    elif FLAGS.num_workers > 1:
      strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    else:
      strategy = tf.compat.v2.distribute.MirroredStrategy()

    with strategy.scope():
      model_lib_v2.train_loop(...) <-----------------------------------Training

Make sure to disable GPU for the evaluation script using the set CUDA_VISIBLE_DEVICES=-1 in the command prompt otherwise it fails on some GPU allocation. This way it works like a champ!! 👊

horczech on 1 Nov 2020

👍2

Are you satisfied with the resolution of your issue?
Yes
No

tensorflow-butler[bot] on 1 Nov 2020

Ok, I solved the problem.

If you look closely at model_main_tf2.py you will find out that you can either run the evaluation when you specify the
FLAGS.checkpoint_dir. When you don't specify it will run the training loop. You cant run both with the current implementation.
  if FLAGS.checkpoint_dir:
    model_lib_v2.eval_continuously(...) <----------------------------------- Evaluation 
  else:
    if FLAGS.use_tpu:
      resolver = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)
      tf.config.experimental_connect_to_cluster(resolver)
      tf.tpu.experimental.initialize_tpu_system(resolver)
      strategy = tf.distribute.experimental.TPUStrategy(resolver)
    elif FLAGS.num_workers > 1:
      strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    else:
      strategy = tf.compat.v2.distribute.MirroredStrategy()

    with strategy.scope():
      model_lib_v2.train_loop(...) <-----------------------------------Training 
The workaround that I found here is to run the evaluation in the parallel command prompt. By running python object_detection/model_main_tf2.py --checkpoint_dir <same path as model_dir> --model_dir <the model_dir you passed in the training process> --pipeline_config_path <path to the pipeline.config file you're training with>

Make sure to disable GPU for the evaluation script using the set CUDA_VISIBLE_DEVICES=-1 in the command prompt otherwise it fails on some GPU allocation. This way it works like a champ!! 👊

Hi, @horczech
I get 2 log dir files. /train and /eval
But with tensorboard, how can I see the eval_loss, I can see train loss with /train and eval mAP with /eval