Models: Could only save 5 checkpoints w/ model_main.py

Created on 13 Aug 2018  路  21Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using:
    research/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 18.04
  • TensorFlow installed from (source or binary):
    Source
  • TensorFlow version (use command below):
    1.10
  • Bazel version (if compiling from source):
    0.15.2
  • CUDA/cuDNN version:
    9.2/7.1.4.18
  • GPU model and memory:
    GTX 960/4G
  • Exact command to reproduce:
    python tf_models/research/object_detection/model_main.py --alsologtostderr
    --pipeline_config_path=ssdlite_mobilenet_v2_coco.config --model_dir=XXX

    Describe the problem

Looks like the new model_main.py hardcodes many parameters. It simply ignores keep_checkpoint_every_n_hours. In another bug, it only exports one image to tensorboard, ignoring
num_visualizations.

Most helpful comment

Thank you @kulkarnivishal It worked for me too.
I made following two changes

  1. keep_checkpoint_max=500 on model_main.py
    config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
  2. max_to_keep=500 on model_lib.py(added on two places)
saver = tf.train.Saver(
    variables_to_restore,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    max_to_keep=500)# <= added max_to_keep argument here


saver = tf.train.Saver(
    sharded=True,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    save_relative_paths=True,
    max_to_keep=500)# <= added max_to_keep argument here

All 21 comments

https://www.tensorflow.org/api_docs/python/tf/train/Saver#__init__
you can use tf.train.Saver(max_to_keep=None) to save all checkpoints when you use legacy/trainer.py

@irmowan do you mean the one on line 408 in model_lib.py? Adding max_to_keep=None to this one has no effect.

@irmowan check issue #5139. With tf r1.10 it can be set in model_main.py

@YijinLiu @fisheess Do you fix this issue? I have the same problem.

@pkulzc Hi, how to set max_to_keep=None. I have changed all tf.train.Saver(max_to_keep=None) in object_detection project, but it's still not working.

@lan2720 It is solved. With the latest tensorflow you need to set the parameters in model_main.py. Check #5139. You need to add keep_checkpoint_max=None to line 59 config = tf.estimator.RunConfig(...)

@lan2720 you can check documents on tf.estimator.RunConfig to get more information.
@YijinLiu I think it is time to close this issue since it is already solved.

@fisheess Thank you! It works.

@fisheess 's method works for me.

Sigh, it's broken again after I synced to the latest code...
It's caused by https://github.com/tensorflow/models/commit/a1337e01db4e3a54c48d2fbd6614c772bdf0f4c5#diff-76df8ca264a059e8a3003851fe4d7849R474
You'll need to add a max_to_keep parameter.

With the latest code, I had to add max_to_keep to RunConfig in model_main.py as well as to tf.train.Saver in model_lib.py then it worked

@YijinLiu @NPetsky do you mean keep_checkpoint_max has no effect now? Then I need to set max_to_keep=None and max_to_keep=None in both files? I will try it out.

@fisheess yes, adding max_to_keep only in model_main.py had no effect for me

@NPetsky max_to_keep is not a valid argument for RunConfig (https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig)
Adding keep_checkpoint_max to tf.estimator.RunConfig in model_main.py and max_to_keep to tf.train.Saver in model_lib.py works

@kulkarnivishal is right, please set keep_checkpoint_max in RunConfig.

Thank you @kulkarnivishal It worked for me too.
I made following two changes

  1. keep_checkpoint_max=500 on model_main.py
    config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
  2. max_to_keep=500 on model_lib.py(added on two places)
saver = tf.train.Saver(
    variables_to_restore,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    max_to_keep=500)# <= added max_to_keep argument here


saver = tf.train.Saver(
    sharded=True,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    save_relative_paths=True,
    max_to_keep=500)# <= added max_to_keep argument here

@prempatra This works perfectly, thanks.

Thank you @kulkarnivishal It worked for me too.
I made following two changes

  1. keep_checkpoint_max=500 on model_main.py
    config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
  2. max_to_keep=500 on model_lib.py(added on two places)

saver = tf.train.Saver(
variables_to_restore,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
max_to_keep=500)# <= added max_to_keep argument here

saver = tf.train.Saver(
sharded=True,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
save_relative_paths=True,
max_to_keep=500)# <= added max_to_keep argument here

it didn't work for deeplab. Does anyone know how to save more checkpoints during training deeplab network instead of the object-detection network?

it's in export_model.py

change the line to:
saver = tf.train.Saver(tf.all_variables().append(max_to_keep=50)) #saver = tf.train.Saver(tf.all_variables())

in order to keep lastest 50 checkpoints

For Tensorflow Object API 2, in model_main_tf2.py line 104, change to this:

model_lib_v2.train_loop(
          pipeline_config_path=FLAGS.pipeline_config_path,
          model_dir=FLAGS.model_dir,
          train_steps=FLAGS.num_train_steps,
          use_tpu=FLAGS.use_tpu,
          checkpoint_every_n=FLAGS.checkpoint_every_n,
          record_summaries=FLAGS.record_summaries,
          checkpoint_max_to_keep=500)

You need to add the last argument checkpoint_max_to_keep, with the number to your liking.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Tsuihao picture Tsuihao  路  90Comments

walkerlala picture walkerlala  路  98Comments

ludazhao picture ludazhao  路  111Comments

waltermaldonado picture waltermaldonado  路  58Comments

ddurgaprasad picture ddurgaprasad  路  48Comments