Models: Eval issues only 1 image in TensorBoard

Created on 11 Aug 2018 · 25Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: object detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 Linux
TensorFlow installed from (source or binary): pip binary 1.10
TensorFlow version (use command below): v1.10.0-0-g656e7a2b34 1.10.0
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0
GPU model and memory: P100 on Google Cloud 16gb ram
Exact command to reproduce:

NUM_TRAIN_STEPS=50000
NUM_EVAL_STEPS=2000
python ./object_detection/model_main.py \
--pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
--model_dir=${PATH_TO_TRAIN_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--num_eval_steps=${NUM_EVAL_STEPS} \
--alsologtostderr

Describe the problem

Evaluation only shows 1 image in Tensorboard, see this image:
https://imgur.com/a/ZgUoaFS

I have tried changing the pipeline config variables but nothing seems to matter:
I tried, max_evals, num_examples, visualization_export_dir, num_visualizations as per:
https://github.com/tensorflow/models/blob/master/research/object_detection/protos/eval.proto

Here is the pipeline.config which is written to the training dir by TF:

model {
  ssd {
    num_classes: 8
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.9997000098228455
          center: true
          scale: true
          epsilon: 0.0010000000474974513
          train: true
        }
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            truncated_normal_initializer {
              mean: 0.0
              stddev: 0.029999999329447746
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.9997000098228455
            center: true
            scale: true
            epsilon: 0.0010000000474974513
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011920929
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298023224
        max_scale: 0.949999988079071
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.33329999446868896
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.9900000095367432
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
  }
}
train_config {
  batch_size: 32
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    rms_prop_optimizer {
      learning_rate {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004000000189989805
          decay_steps: 800720
          decay_factor: 0.949999988079071
        }
      }
      momentum_optimizer_value: 0.8999999761581421
      decay: 0.8999999761581421
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/home/example/models/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
  num_steps: 50000
  fine_tune_checkpoint_type: "detection"
}
train_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/train.record"
  }
}
eval_config {
  num_examples: 2000
  max_evals: 10
  visualization_export_dir: "/home/example/models/2018-08-11/training/eval_images"
  metrics_set: "coco_detection_metrics"
  retain_original_images: true
}
eval_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  shuffle: true
  num_readers: 1
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/test.record"
  }
}

I have looked at this and tried making this change but it makes no difference.
https://stackoverflow.com/questions/51636600/tensorflow-1-9-object-detection-model-main-py-only-evaluates-one-image

Source code / logs

See above

Source

madhavajay

👍6

Most helpful comment

This has been fixed and will go out in next release.

pkulzc on 15 Aug 2018

👍12 ❤2

All 25 comments

Same problem here.

YijinLiu on 12 Aug 2018

👍11

I think the problem is related to the fact that evaluation is made single batch, and this is not properly handled for the visualization to keep "state" between batches, that's why they chose to start simple with only one image. I tried to check if it was related to the summary being always overwritten by adding a random suffix to the summary and eval_metrics names (dictionary keys) but without success...

AVCarreiro on 13 Aug 2018

Its annoying because it worked in previous versions. To avoid issues with the GPU Memory I ran the eval.py script with CUDA_VISIBLE_DEVICES=-1 which then ran it on the CPU independently of training. Also changing the NUM_EVAL_STEPS doesnt seem to have the expected effect of increasing or decreasing how often evaluation is ran.

madhavajay on 14 Aug 2018

👍2

This has been fixed and will go out in next release.

pkulzc on 15 Aug 2018

👍12 ❤2

@pkulzc Thanks. When will the next release be pushed?

YijinLiu on 15 Aug 2018

@pkulzc awesome your a ledge!!! 👍

madhavajay on 15 Aug 2018

Thanks a lot. When will the this change be pushed ? I will need this function a lot. ^^

yilcheng on 16 Aug 2018

👍1

Thank you! I hope it will be released really soon, without it evaluations are useless.

Cospel on 16 Aug 2018

👍1

@pkulzc Can you please link to the commit that fixes this? It would be highly appreciated as I couldn't find it. Thanks!

l33tl4bs on 22 Aug 2018

👍2

@pkulzc it would be nice to have a walk around until the new release. thanks

ernstgoyer on 23 Aug 2018

👍1

I tried to find an easy workaround but I couldn't. Any idea when the update will be released? Thanks.

ldalzovo on 31 Aug 2018

👍2

Running into same issue... it worked fine acouple months ago. @pkulzc any ETA or work around?

aysark on 5 Sep 2018

👍1

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.

python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>

david-macleod on 5 Sep 2018

👍4

@pkulzc This has been a while. When will the next release be out? I wonder could you do a bug fix release instead of a full release, if the later is difficult.

YijinLiu on 10 Sep 2018

@pkulzc Hope it will come soon.

lan2720 on 13 Sep 2018

Pull request is under review now.

pkulzc on 13 Sep 2018

👍8

@pkulzc Hi, any update on the PR ?

Harshini-Gadige on 25 Sep 2018

@harshini-gadige PR has already been merged into master and the issue is resolved

david-macleod on 26 Sep 2018

🎉1

Hi I am still have the same issue using google ML Engineer, with runtime 1.10 or 1.9. Tried to use 1.11, got the error: "INVALID_ARGUMENT: Field: runtime_version Error: The specified runtime version '1.11' with the Python version '' is not supported or is deprecated. Please specify a different runtime version. See https://cloud.google.com/ml/docs/concepts/runtime-version-list for a list of supported versions"

didopop3 on 14 Oct 2018

I've stupid question: the above problem is only a "Display" issue (i.e. Tensorboard only displayed 1 image of evaluation) or is it really a problem of the evaluation (i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image !!)
Thanks for your answer !

a2bc on 15 Oct 2018

👍2

Hi, I am facing same issue i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image. Anybody have fixed this??
Thanks

BalajiB3663 on 1 Jan 2019

If you want to have more visualizations, try setting this field .

If you want to control the fraction of data eval'ed by the eval job, try setting this field.

Note that the upper config field lives in eval_config, while the second one is in input reader.

pkulzc on 1 Jan 2019

👍1

Fixed after updating config files.
It should be the num_visualizations parameter in your eval_config, the parameter helps in fetching random images for evaluation in tensorboard.

BalajiB3663 on 1 Jan 2019

👍1

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.
python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>

what should i write to eval_dir?

MertAliTombul on 13 May 2019

FYI, this is not found in the tutorial
(https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#evaluating-the-model-optional)
It's probably a useful thing to have as it makes it easy to find out early on if you have a problem