Models: Could only save 5 checkpoints w/ model_main.py

Created on 13 Aug 2018 · 21Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.04
TensorFlow installed from (source or binary):
Source
TensorFlow version (use command below):
1.10
Bazel version (if compiling from source):
0.15.2
CUDA/cuDNN version:
9.2/7.1.4.18
GPU model and memory:
GTX 960/4G
Exact command to reproduce:
python tf_models/research/object_detection/model_main.py --alsologtostderr
--pipeline_config_path=ssdlite_mobilenet_v2_coco.config --model_dir=XXX

Describe the problem

Looks like the new model_main.py hardcodes many parameters. It simply ignores keep_checkpoint_every_n_hours. In another bug, it only exports one image to tensorboard, ignoring
num_visualizations.

Source

YijinLiu

👍8

Most helpful comment

Thank you @kulkarnivishal It worked for me too.
I made following two changes

keep_checkpoint_max=500 on model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
max_to_keep=500 on model_lib.py(added on two places)

saver = tf.train.Saver(
    variables_to_restore,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    max_to_keep=500)# <= added max_to_keep argument here


saver = tf.train.Saver(
    sharded=True,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    save_relative_paths=True,
    max_to_keep=500)# <= added max_to_keep argument here

prempatra on 9 Aug 2019

👍7 ❤2 🎉1

All 21 comments

https://www.tensorflow.org/api_docs/python/tf/train/Saver#__init__
you can use tf.train.Saver(max_to_keep=None) to save all checkpoints when you use legacy/trainer.py

irmowan on 15 Aug 2018

👍3

@irmowan do you mean the one on line 408 in model_lib.py? Adding max_to_keep=None to this one has no effect.

fisheess on 11 Sep 2018

@irmowan check issue #5139. With tf r1.10 it can be set in model_main.py

fisheess on 11 Sep 2018

@YijinLiu @fisheess Do you fix this issue? I have the same problem.

lan2720 on 13 Sep 2018

@pkulzc Hi, how to set max_to_keep=None. I have changed all tf.train.Saver(max_to_keep=None) in object_detection project, but it's still not working.

lan2720 on 13 Sep 2018

@lan2720 It is solved. With the latest tensorflow you need to set the parameters in model_main.py. Check #5139. You need to add keep_checkpoint_max=None to line 59 config = tf.estimator.RunConfig(...)

fisheess on 13 Sep 2018

👍5 👀2

@lan2720 you can check documents on tf.estimator.RunConfig to get more information.
@YijinLiu I think it is time to close this issue since it is already solved.

fisheess on 13 Sep 2018

@fisheess Thank you! It works.

lan2720 on 13 Sep 2018

@fisheess 's method works for me.

YijinLiu on 4 Oct 2018

Also check this: https://stackoverflow.com/a/51858873/8407621

saeedarabi92 on 25 Dec 2018

Sigh, it's broken again after I synced to the latest code...
It's caused by https://github.com/tensorflow/models/commit/a1337e01db4e3a54c48d2fbd6614c772bdf0f4c5#diff-76df8ca264a059e8a3003851fe4d7849R474
You'll need to add a max_to_keep parameter.

YijinLiu on 5 Jan 2019

With the latest code, I had to add max_to_keep to RunConfig in model_main.py as well as to tf.train.Saver in model_lib.py then it worked

NPetsky on 11 Jan 2019

@YijinLiu @NPetsky do you mean keep_checkpoint_max has no effect now? Then I need to set max_to_keep=None and max_to_keep=None in both files? I will try it out.

fisheess on 8 Feb 2019

@fisheess yes, adding max_to_keep only in model_main.py had no effect for me

NPetsky on 12 Feb 2019

👍5 😄2 ❤1 🎉1

@NPetsky max_to_keep is not a valid argument for RunConfig (https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig)
Adding keep_checkpoint_max to tf.estimator.RunConfig in model_main.py and max_to_keep to tf.train.Saver in model_lib.py works

kulkarnivishal on 8 May 2019

👍2 👀1

@kulkarnivishal is right, please set keep_checkpoint_max in RunConfig.

pkulzc on 26 Jun 2019

Thank you @kulkarnivishal It worked for me too.
I made following two changes

keep_checkpoint_max=500 on model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
max_to_keep=500 on model_lib.py(added on two places)

saver = tf.train.Saver(
    variables_to_restore,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    max_to_keep=500)# <= added max_to_keep argument here


saver = tf.train.Saver(
    sharded=True,
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
    save_relative_paths=True,
    max_to_keep=500)# <= added max_to_keep argument here

prempatra on 9 Aug 2019

👍7 ❤2 🎉1

@prempatra This works perfectly, thanks.

surfii3z on 30 Oct 2019

Thank you @kulkarnivishal It worked for me too.
I made following two changes

keep_checkpoint_max=500 on model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)

max_to_keep=500 on model_lib.py(added on two places)

saver = tf.train.Saver(
variables_to_restore,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
max_to_keep=500)# <= added max_to_keep argument here

saver = tf.train.Saver(
sharded=True,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
save_relative_paths=True,
max_to_keep=500)# <= added max_to_keep argument here

it didn't work for deeplab. Does anyone know how to save more checkpoints during training deeplab network instead of the object-detection network?

zheyuanWang on 10 Dec 2019

it's in export_model.py

change the line to:
saver = tf.train.Saver(tf.all_variables().append(max_to_keep=50)) #saver = tf.train.Saver(tf.all_variables())

in order to keep lastest 50 checkpoints

zheyuanWang on 22 Jun 2020

For Tensorflow Object API 2, in model_main_tf2.py line 104, change to this:

model_lib_v2.train_loop(
          pipeline_config_path=FLAGS.pipeline_config_path,
          model_dir=FLAGS.model_dir,
          train_steps=FLAGS.num_train_steps,
          use_tpu=FLAGS.use_tpu,
          checkpoint_every_n=FLAGS.checkpoint_every_n,
          record_summaries=FLAGS.record_summaries,
          checkpoint_max_to_keep=500)

You need to add the last argument checkpoint_max_to_keep, with the number to your liking.