Models: [object_detection] Feature: Resume training from last checkpoint

Created on 28 Apr 2018 · 13Comments · Source: tensorflow/models

This is about the object_detection repo:

As far as i know, when training breaks because of any kind of error and you want to continue training from the last saved checkpoint you need to manually adress the checkpoint from where you want to resume training in your models config file.

So now my question: Is there a way to let the model find the last saved checkpoint and continue from it automatically? If not I think it would be a nice feature to add in train protos and should not be too hard to implement.

Maybe something like this (taken from matterport):

def find_last(self):
        """Finds the last checkpoint file of the last trained model in the
        model directory.
        Returns:
            log_dir: The directory where events and weights are saved
            checkpoint_path: the path to the last checkpoint file
        """
        # Get directory names. Each directory corresponds to a model
        dir_names = next(os.walk(self.model_dir))[1]
        key = self.config.NAME.lower()
        dir_names = filter(lambda f: f.startswith(key), dir_names)
        dir_names = sorted(dir_names)
        if not dir_names:
            return None, None
        # Pick last directory
        dir_name = os.path.join(self.model_dir, dir_names[-1])
        # Find the last checkpoint
        checkpoints = next(os.walk(dir_name))[2]
        checkpoints = filter(lambda f: f.startswith("mask_rcnn"), checkpoints)
        checkpoints = sorted(checkpoints)
        if not checkpoints:
            return dir_name, None
        checkpoint = os.path.join(dir_name, checkpoints[-1])
return dir_name, checkpoint

awaiting response feature

Source

gustavz

Most helpful comment

@GustavZ I believe restarting the train.py job with the same command line arguments should pick up the last saved checkpoint in the checkpoint directory. This is a feature built into Supervisor, which the TF Object Detection API uses.

Have you noticed a situation where killing a training job doesn't load the last checkpoint?

derekjchow on 7 May 2018

👍4

All 13 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

tensorflowbutler on 28 Apr 2018

CC @derekjchow for your thoughts on this feature request.

karmel on 30 Apr 2018

As workaround i wrote a shell script that automatically updates the config with the last saved checkpoint of the adressed directory and restarts training if it breaks due to any error. Idk if this is of interest for you...

gustavz on 2 May 2018

@GustavZ,hi,many training works break up by "OOM",can your script release memory of GPU first when it restart training automaticallly?

liangxiao05 on 7 May 2018

@liangxiao05 yes it does, as it restarts all python processes which allocate the gpu mem

gustavz on 7 May 2018

👍2

that's cool , and I think you don't need to wirte the checkpoint in the config file when training breaks,just restart the 'python object_detection/train.py '. I support you to open this PR,it will be useful ,thanks!

liangxiao05 on 7 May 2018

👍1

i dont think just restarting train.py is not enough as it always from the provided checkpoint in the config and if it does not get updated it always restarts from the same point. So basically thats the whole point of the small script i wrote, reading the most recent checkpoint number and updating the config with it.

gustavz on 7 May 2018

Have you noticed a situation where killing a training job doesn't load the last checkpoint?

derekjchow on 7 May 2018

👍4

Training resumes from the latest checkpoint it has saved if 'from_detection_checkpoint' is set to True in config file. You can see this being used when creating a model in line 250-256 in trainer.py

pinkbunny1 on 31 May 2018

👍1

I had the same issue. You can just set NUM_TRAIN_STEPS to None and also point fine_tune directory to the same directory you wan to load. Then it should work.

junweima on 31 Aug 2018

@liangxiao05 yes it does, as it restarts all python processes which allocate the gpu mem

eng100200 on 6 Nov 2018

@GustavZ Can you help me in re-starting training from the last check point??

eng100200 on 6 Nov 2018

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings