Looks like the new model_main.py hardcodes many parameters. It simply ignores keep_checkpoint_every_n_hours. In another bug, it only exports one image to tensorboard, ignoring
num_visualizations.
https://www.tensorflow.org/api_docs/python/tf/train/Saver#__init__
you can use tf.train.Saver(max_to_keep=None) to save all checkpoints when you use legacy/trainer.py
@irmowan do you mean the one on line 408 in model_lib.py? Adding max_to_keep=None to this one has no effect.
@irmowan check issue #5139. With tf r1.10 it can be set in model_main.py
@YijinLiu @fisheess Do you fix this issue? I have the same problem.
@pkulzc Hi, how to set max_to_keep=None
. I have changed all tf.train.Saver(max_to_keep=None) in object_detection project, but it's still not working.
@lan2720 It is solved. With the latest tensorflow you need to set the parameters in model_main.py. Check #5139. You need to add keep_checkpoint_max=None
to line 59 config = tf.estimator.RunConfig(...)
@lan2720 you can check documents on tf.estimator.RunConfig
to get more information.
@YijinLiu I think it is time to close this issue since it is already solved.
@fisheess Thank you! It works.
@fisheess 's method works for me.
Also check this: https://stackoverflow.com/a/51858873/8407621
Sigh, it's broken again after I synced to the latest code...
It's caused by https://github.com/tensorflow/models/commit/a1337e01db4e3a54c48d2fbd6614c772bdf0f4c5#diff-76df8ca264a059e8a3003851fe4d7849R474
You'll need to add a max_to_keep parameter.
With the latest code, I had to add max_to_keep to RunConfig in model_main.py as well as to tf.train.Saver in model_lib.py then it worked
@YijinLiu @NPetsky do you mean keep_checkpoint_max has no effect now? Then I need to set max_to_keep=None and max_to_keep=None in both files? I will try it out.
@fisheess yes, adding max_to_keep only in model_main.py had no effect for me
@NPetsky max_to_keep is not a valid argument for RunConfig (https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig)
Adding keep_checkpoint_max to tf.estimator.RunConfig in model_main.py and max_to_keep to tf.train.Saver in model_lib.py works
@kulkarnivishal is right, please set keep_checkpoint_max in RunConfig.
Thank you @kulkarnivishal It worked for me too.
I made following two changes
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
saver = tf.train.Saver( variables_to_restore, keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours, max_to_keep=500)# <= added max_to_keep argument here saver = tf.train.Saver( sharded=True, keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours, save_relative_paths=True, max_to_keep=500)# <= added max_to_keep argument here
@prempatra This works perfectly, thanks.
Thank you @kulkarnivishal It worked for me too.
I made following two changes
- keep_checkpoint_max=500 on model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)
- max_to_keep=500 on model_lib.py(added on two places)
saver = tf.train.Saver(
variables_to_restore,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
max_to_keep=500)# <= added max_to_keep argument heresaver = tf.train.Saver(
sharded=True,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
save_relative_paths=True,
max_to_keep=500)# <= added max_to_keep argument here
it didn't work for deeplab. Does anyone know how to save more checkpoints during training deeplab network instead of the object-detection network?
it's in export_model.py
change the line to:
saver = tf.train.Saver(tf.all_variables().append(max_to_keep=50))
#saver = tf.train.Saver(tf.all_variables())
in order to keep lastest 50 checkpoints
For Tensorflow Object API 2, in model_main_tf2.py
line 104, change to this:
model_lib_v2.train_loop(
pipeline_config_path=FLAGS.pipeline_config_path,
model_dir=FLAGS.model_dir,
train_steps=FLAGS.num_train_steps,
use_tpu=FLAGS.use_tpu,
checkpoint_every_n=FLAGS.checkpoint_every_n,
record_summaries=FLAGS.record_summaries,
checkpoint_max_to_keep=500)
You need to add the last argument checkpoint_max_to_keep, with the number to your liking.
Most helpful comment
Thank you @kulkarnivishal It worked for me too.
I made following two changes
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)