Models: Eval issues only 1 image in TensorBoard

Created on 11 Aug 2018  路  25Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: object detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 Linux
  • TensorFlow installed from (source or binary): pip binary 1.10
  • TensorFlow version (use command below): v1.10.0-0-g656e7a2b34 1.10.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 9.0
  • GPU model and memory: P100 on Google Cloud 16gb ram
  • Exact command to reproduce:

NUM_TRAIN_STEPS=50000
NUM_EVAL_STEPS=2000
python ./object_detection/model_main.py \
--pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
--model_dir=${PATH_TO_TRAIN_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--num_eval_steps=${NUM_EVAL_STEPS} \
--alsologtostderr

Describe the problem

Evaluation only shows 1 image in Tensorboard, see this image:
https://imgur.com/a/ZgUoaFS

I have tried changing the pipeline config variables but nothing seems to matter:
I tried, max_evals, num_examples, visualization_export_dir, num_visualizations as per:
https://github.com/tensorflow/models/blob/master/research/object_detection/protos/eval.proto

Here is the pipeline.config which is written to the training dir by TF:

model {
  ssd {
    num_classes: 8
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.9997000098228455
          center: true
          scale: true
          epsilon: 0.0010000000474974513
          train: true
        }
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            truncated_normal_initializer {
              mean: 0.0
              stddev: 0.029999999329447746
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.9997000098228455
            center: true
            scale: true
            epsilon: 0.0010000000474974513
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011920929
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298023224
        max_scale: 0.949999988079071
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.33329999446868896
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.9900000095367432
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
  }
}
train_config {
  batch_size: 32
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    rms_prop_optimizer {
      learning_rate {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004000000189989805
          decay_steps: 800720
          decay_factor: 0.949999988079071
        }
      }
      momentum_optimizer_value: 0.8999999761581421
      decay: 0.8999999761581421
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/home/example/models/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
  num_steps: 50000
  fine_tune_checkpoint_type: "detection"
}
train_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/train.record"
  }
}
eval_config {
  num_examples: 2000
  max_evals: 10
  visualization_export_dir: "/home/example/models/2018-08-11/training/eval_images"
  metrics_set: "coco_detection_metrics"
  retain_original_images: true
}
eval_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  shuffle: true
  num_readers: 1
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/test.record"
  }
}

I have looked at this and tried making this change but it makes no difference.
https://stackoverflow.com/questions/51636600/tensorflow-1-9-object-detection-model-main-py-only-evaluates-one-image

Source code / logs

See above

Most helpful comment

This has been fixed and will go out in next release.

All 25 comments

Same problem here.

I think the problem is related to the fact that evaluation is made single batch, and this is not properly handled for the visualization to keep "state" between batches, that's why they chose to start simple with only one image. I tried to check if it was related to the summary being always overwritten by adding a random suffix to the summary and eval_metrics names (dictionary keys) but without success...

Its annoying because it worked in previous versions. To avoid issues with the GPU Memory I ran the eval.py script with CUDA_VISIBLE_DEVICES=-1 which then ran it on the CPU independently of training. Also changing the NUM_EVAL_STEPS doesnt seem to have the expected effect of increasing or decreasing how often evaluation is ran.

This has been fixed and will go out in next release.

@pkulzc Thanks. When will the next release be pushed?

@pkulzc awesome your a ledge!!! 馃憤

Thanks a lot. When will the this change be pushed ? I will need this function a lot. ^^

Thank you! I hope it will be released really soon, without it evaluations are useless.

@pkulzc Can you please link to the commit that fixes this? It would be highly appreciated as I couldn't find it. Thanks!

@pkulzc it would be nice to have a walk around until the new release. thanks

I tried to find an easy workaround but I couldn't. Any idea when the update will be released? Thanks.

Running into same issue... it worked fine acouple months ago. @pkulzc any ETA or work around?

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.

python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>

@pkulzc This has been a while. When will the next release be out? I wonder could you do a bug fix release instead of a full release, if the later is difficult.

@pkulzc Hope it will come soon.

Pull request is under review now.

@pkulzc Hi, any update on the PR ?

@harshini-gadige PR has already been merged into master and the issue is resolved

Hi I am still have the same issue using google ML Engineer, with runtime 1.10 or 1.9. Tried to use 1.11, got the error: "INVALID_ARGUMENT: Field: runtime_version Error: The specified runtime version '1.11' with the Python version '' is not supported or is deprecated. Please specify a different runtime version. See https://cloud.google.com/ml/docs/concepts/runtime-version-list for a list of supported versions"

I've stupid question: the above problem is only a "Display" issue (i.e. Tensorboard only displayed 1 image of evaluation) or is it really a problem of the evaluation (i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image !!)
Thanks for your answer !

Hi, I am facing same issue i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image. Anybody have fixed this??
Thanks

If you want to have more visualizations, try setting this field .

If you want to control the fraction of data eval'ed by the eval job, try setting this field.

Note that the upper config field lives in eval_config, while the second one is in input reader.

Fixed after updating config files.
It should be the num_visualizations parameter in your eval_config, the parameter helps in fetching random images for evaluation in tensorboard.

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.

python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>

what should i write to eval_dir?

FYI, this is not found in the tutorial
(https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#evaluating-the-model-optional)
It's probably a useful thing to have as it makes it easy to find out early on if you have a problem

Was this page helpful?
0 / 5 - 0 ratings