Ray: [rllib] What is the proper way to restore checkpoint for fine-tuning / rendering / evaluation of a trained agent based on example/multiagent_cartpole.py?

Created on 5 Apr 2019  ·  9Comments  ·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): pip install ray
  • Ray version: 0.6.5
  • Python version: 3.6.2
  • Exact command to reproduce:

Describe the problem

Before my question, let me introduce my understanding of the checkpoint file system. (you can skip it and toward my question)

The codes in example/multiagent_cartpole.py produces a experiment_state-2019-04-03_00-47-28.json-like file and a directory PPO_experiment_name with a few .pkl, .json, .csv files in it.

The file system looks like:

- local_dir (say: "~/ray_results")
    - exp_name (say: "PPO")
        - checkpoints (say: experiment_state-2019-04-05_17-59-00.json)
        - directory (named like: PPO_cartpole_0_2019-04-05_18-28-0296h2tknq)
            - xxx.log
            - params.json
            - params.pkl (This is the file to store trained parameter, I guess?)
            - progress.csv
            - result.json

After one successful training, now we have a trained agent (Because I used one shared policy for all agent). We set the local_dir exactly the same as training. Then set the exp_name exactly as training too, namely PPO.

Now it's my problem. The tune.run function take two arguments which looks like helpful for restoring.

"resume" argument

The resume argument, once set to True, will automatically search in local_dir/exp_name/ finding the most recent experiment_state-<date_time>.json.

The resume work well. After setting it to true, the restoring seems to be successful, but the program immediately terminated, as if it inherit the termination states from the checkpoint.

Here's the log:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs
Memory usage on this node: 4.3/16.7 GB
Result logdir: /home/SENSETIME/pengzhenghao/ray_results/PPO
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
 - PPO_tollgate_0:  TERMINATED, [12 CPUs, 1 GPUs], [pid=9214], 4846 s, 300 iter, 1320000 ts, 1.1e+03 rew

The printed reward is exactly what trained agent able to give, but I cannot continue to train this agent, even if I set the num_iters greater than the number of iterations in last training (namely 300).

What's more, it seems impossible using the resume argument to specify a checkpoint given the exact filename.

In a nut shell, my question on the resume argument is:

  1. What's the meaning for this argument? It seems like it's only used for restore checkpoint from unexpected failures. Therefore, it cannot be used to restore specified checkpoint. Am I correct?

"restore" argument

After setting restore=<log_dir>, namely restore="./experiments", which is my log_dir, it turn out to be an error:

Traceback (most recent call last):
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 499, in restore
    ray.get(trial.runner.restore.remote(value))
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_PPOAgent:restore() (pid=28099, host=g114e1900387)
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/tune/trainable.py", line 304, in restore
    with open(checkpoint_path + ".tune_metadata", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './experiments.tune_metadata'

I have checked everywhere of this computer and there is no such a file ended with .tune_metadata. I am really confusing.

In short, what I am trying to do is:

  1. Restore the trained agent and continue it's training with the same config.

  2. Restore the trained agent, retrieve the Policy network, and used in the same environment with rendering, in order to visualize it's performance.

  3. Restore the trained agent as a pre-trained agent and modify the config, such as using more workers and GPU to training on cluster.

Could you please tell me what I should do?

(By the way, the document is really insufficient for thoroughly understanding the whole process of rllib. Nevertheless I still appreciate your guys for this excellent project, wish some day I can make some contribution too~)

Most helpful comment

Since I was searching for a simple way to load a trained agent and continue training with RLlib, and I only found this issue, here's what I found & what's the easiest way in my opinion:

ray.tune.run(PPOTrainer, config=myconfig, restore=path_to_trained_agent_checkpoint)

Ie, just set the path in the restore argument, that's it! No need for a custom train function.

All 9 comments

I think there is some confusion here about tune's checkpointing of experiment state, vs RLlib's checkpointing of trial state.

To enable RLlib checkpointing, you have to specify --checkpoint-freq. For example: rllib train --run=PG --checkpoint-freq=1 --env=CartPole-v0

Then, this will create checkpoints in ~/ray_results that includes the .tune_metadata file. To restore, you can specify one of those paths, for example rllib train --run=PG --env=CartPole-v0 --restore=$HOME/ray_results/default/PG_CartPole-v0_0_2019-04-05_16-43-02s_gcpmkl/checkpoint_9/checkpoint-9.

This path call also be passed to agent.restore() in the Python API, which allows support for more advanced use cases like (2). For (1) and (3) I think the --restore flag for Tune may work.

By the way, the document is really insufficient for thoroughly understanding the whole process of rllib.

Agree! This part happens to be halfway documented in RLlib and half in Tune. Some of it is here: https://ray.readthedocs.io/en/latest/rllib-training.html Any suggestions on how to improve this would be helpful.

Thanks for your reply! I find that if running tune.run without any change, the parameter of trained agent would not saved... The .pkl file saved automatically simply records “trail” instead of anything related to neural network.

Unfortunately the training for last few days goes in vain. I suggest to turn checkpoint_at_end be True as default...

The only blocker for enabling checkpointing by default is https://github.com/ray-project/ray/pull/4490

That will avoid out of disk space errors for long training.

For the potential reader:

resume argument do nothing but continue the last unfinished task. In this mode, it's no allowed to reset the num_iters.

restore argument take the path of the checkpoint file as input. Concretely, the file look like ~/ray_results/expname/envname_date_someothercodes/checkpoint_10/checkpoint-10. Note that the checkpoint files would only exist for those tune.run() executions with checkpoint_at_end=True or checkpoint_freq setting to non-zero value.

Using restore argument and taking the checkpoint from which you want to continue the experiment is the only way to enlarge the number of iterations of a finished or unfinished experiment.

Thank Eric for offering quick and kind responses!

For me, the restore does not work no matter what I try it seems.

tune.run(
"PPO",
#name="PPO_discrete5",
local_dir="/content/drive/My Drive/Colab Notebooks/rltrader/Experiments",

checkpoint_freq=10, # iterations
checkpoint_at_end=True,
max_failures=100,
#resume=True,
restore='content/drive/My Drive/Colab Notebooks/rltrader/Experiments/PPO_discrete3/PPO_ContTradingEnv_0_2019-05-03_04-51-02zykryvgl/checkpoint_218/checkpoint-218',

#search_alg=algo,
#scheduler=ahb,
# 2 if testing, 50 or more for real
#num_samples=50,

stop={
    # "episode_reward_mean": 0,
    # "training_iteration": 1,
    # "timesteps_total": 1000,
    "episodes_total": 1000,
},

What combination of the above do I need to restore a checkpoint using tune.run, or is restore not working? I have run 1000 episodes, and wish to run 1000 more.

@evanatyourservice Are you using the latest Ray? The ray failed to restore due to a bug and still not fixed yet. See https://github.com/ray-project/ray/pull/4733

@evanatyourservice Please re-run your codes using the latest Ray and see if everything work well.

My view of the resume/restore:

  • you start a training with a lot of grid search for a number of interations, let's say on some gamma options
  • you add checkpoint_freq=10, checkpoint_at_end=True to ray.tune
  • you put some unique name name=name and you specify local_dir in some folder you have space. Name paramter is the folder of this experiment (important)

If somehow the running stops, you add resume=True. If any of the trials give an error, resume won't restart them.

Here comes the nice part:

  • if you get an errored trail, let's say due to memory, you add restore parameter. How to do that: run
    find "[local_dir]/[name]" -iname checkpoint-[K]
    where K is the last checkpoint created, or last iteration, the checkpoint from which you want to retry something different
  • when you do that, you should disable all grid search and specificy all parameters manually, as you want to resume them. If you leave grid search on, it will restart the grid search with the first set of paramters on grid search from there, and run multiple trials from that point. So, I recommend you specify gamma, lr (or lr_schedule), number of iterations manually, but this is the good part, as it allows you to resume any past checkpoint and go longer number of iterations or try other things (gamma/lr can be changed on a checkpoint)
  • one other idea, by keeping the same name and same local_dir, you can start more grid search variations in the same folder. Later, you can analyse and plot the whole thing, so try different parameters, go for longer learning interations, you can iterate till the analysis shows what you want it to show.

Other ideas:
LR:

  • try a constant lr or an lr_schedule, like:

    'lr_schedule': [[0 * lr_batch_size, 5e-5],
    [75 * lr_batch_size, 5e-5],
    [110 * lr_batch_size, 1e-5],
    [110 * lr_batch_size, 1e-5],
    [120 * lr_batch_size, 5e-6],
    [140 * lr_batch_size, 5e-7],
    [200 * lr_batch_size, 1e-10],
    [300 * lr_batch_size, 1e-12],
    ]
    where lr_batch_size is your number of timesteps per interation.
  • the learning rate can be resumed at any time from a checkpoint

Chart:

  • plot episode_mean_reward by training_iteration, and also by training_interation you can chose to plot:

    • episode_reward_min, episode_reward_max

    • cur_lr, look for the column in the dataframe ending in "/cur_lr"

  • if there is interest, I can post my code on charting, based on Plotly

Since I was searching for a simple way to load a trained agent and continue training with RLlib, and I only found this issue, here's what I found & what's the easiest way in my opinion:

ray.tune.run(PPOTrainer, config=myconfig, restore=path_to_trained_agent_checkpoint)

Ie, just set the path in the restore argument, that's it! No need for a custom train function.

Was this page helpful?
0 / 5 - 0 ratings