Ray: [rllib] Best workflow to train, save, and test agent

Created on 24 Jun 2020 · 5Comments · Source: ray-project/ray

What is your question?

This is a great framework, but after reading the documentation and playing around for weeks, I'm still struggeling to get the simple workflow working: Train a PPO agent, save a checkpoint at the end, save stats, and use the trained agent for evaluation or visualization in the end.

It starts with my confusion about the two ways of training an RL agent.
Either

trainer = PPOTrainer(env="CartPole-v0", config={"train_batch_size": 4000})
while True:
    print(trainer.train())

Which makes saving my agent simple with trainer.save(path) and I can use the trained agent afterwards for testing with trainer.compute_action(observation). But: Afaik, I cannot change the log directory, which always defaults to ~/ray-results.

Or I use ray.tune.run():

from ray import tune
tune.run(PPOTrainer, config={"env": "CartPole-v0", "train_batch_size": 4000}, local_dir=my_path, checkpoint_at_end=True)

Which allows me to configure a custom local_dir to put my logs in and create a checkpoint at the end. But: Afaik, I don't have access to my trained agent. ray.tune.run() just returns an ExperimentAnalysis object but not my trained agent nor the exact path of the checkpoints (which includes some random hash) such that I could load the agent. The experiment_id in the results does not correspond to the hash that's used in the dir name, so I cannot reconstruct the dir name.

My only resort at the moment is to split training with ray.tune.run and then loading and testing the agent into two separate steps, where I have to find and copy & past the path of the last checkpoint manually in between. Very inconvenient.

There must be a more convenient way to do what I want, right?

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.5
Tensorflow 2.2.0
Python 3.8.3
OS: Ubuntu 20.04 on WSL (Win 10)

question

Source

stefanbschneider

Most helpful comment

I finally got a workflow that does everything I want; train with configurable log dir, return the saved agent path, load the trained agent, and use it for testing.

Here's the basic code (within a custom class):

def train(self, stop_criteria):
    """
    Train an RLlib PPO agent using tune until any of the configured stopping criteria is met.
    :param stop_criteria: Dict with stopping criteria.
        See https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run
    :return: Return the path to the saved agent (checkpoint) and tune's ExperimentAnalysis object
        See https://docs.ray.io/en/latest/tune/api_docs/analysis.html#experimentanalysis-tune-experimentanalysis
    """
    analysis = ray.tune.run(ppo.PPOTrainer, config=self.config, local_dir=self.save_dir, stop=stop_criteria,
                            checkpoint_at_end=True)
    # list of lists: one list per checkpoint; each checkpoint list contains 1st the path, 2nd the metric value
    checkpoints = analysis.get_trial_checkpoints_paths(trial=analysis.get_best_trial('episode_reward_mean'),
                                                       metric='episode_reward_mean')
    # retriev the checkpoint path; we only have a single checkpoint, so take the first one
    checkpoint_path = checkpoints[0][0]
    return checkpoint_path, analysis

def load(self, path):
    """
    Load a trained RLlib agent from the specified path. Call this before testing a trained agent.
    :param path: Path pointing to the agent's saved checkpoint (only used for RLlib agents)
    """
    self.agent = ppo.PPOTrainer(config=self.config, env=self.env_class)
    self.agent.restore(path)

def test(self):
    """Test trained agent for a single episode. Return the episode reward"""
    # instantiate env class
    env = self.env_class(self.env_config)

    # run until episode ends
    episode_reward = 0
    done = False
    obs = env.reset()
    while not done:
        action = self.agent.compute_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward

    return episode_reward

With that you can just call train, load, test and it should work. I hope this helps.

Not sure if there's any other/better way to do it. But it solves my issue.

stefanbschneider on 25 Jun 2020

👍7

All 5 comments

Hi @stefanbschneider! I'm having the same problems to understand this... Can you explain how your solution works, please?

Catypad on 25 Jun 2020

I finally got a workflow that does everything I want; train with configurable log dir, return the saved agent path, load the trained agent, and use it for testing.

Here's the basic code (within a custom class):

def train(self, stop_criteria):
    """
    Train an RLlib PPO agent using tune until any of the configured stopping criteria is met.
    :param stop_criteria: Dict with stopping criteria.
        See https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run
    :return: Return the path to the saved agent (checkpoint) and tune's ExperimentAnalysis object
        See https://docs.ray.io/en/latest/tune/api_docs/analysis.html#experimentanalysis-tune-experimentanalysis
    """
    analysis = ray.tune.run(ppo.PPOTrainer, config=self.config, local_dir=self.save_dir, stop=stop_criteria,
                            checkpoint_at_end=True)
    # list of lists: one list per checkpoint; each checkpoint list contains 1st the path, 2nd the metric value
    checkpoints = analysis.get_trial_checkpoints_paths(trial=analysis.get_best_trial('episode_reward_mean'),
                                                       metric='episode_reward_mean')
    # retriev the checkpoint path; we only have a single checkpoint, so take the first one
    checkpoint_path = checkpoints[0][0]
    return checkpoint_path, analysis

def load(self, path):
    """
    Load a trained RLlib agent from the specified path. Call this before testing a trained agent.
    :param path: Path pointing to the agent's saved checkpoint (only used for RLlib agents)
    """
    self.agent = ppo.PPOTrainer(config=self.config, env=self.env_class)
    self.agent.restore(path)

def test(self):
    """Test trained agent for a single episode. Return the episode reward"""
    # instantiate env class
    env = self.env_class(self.env_config)

    # run until episode ends
    episode_reward = 0
    done = False
    obs = env.reset()
    while not done:
        action = self.agent.compute_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward

    return episode_reward

With that you can just call train, load, test and it should work. I hope this helps.

Not sure if there's any other/better way to do it. But it solves my issue.

stefanbschneider on 25 Jun 2020

👍7

Just for the record, tried your code on Impala and works fine, except when use.lstm is True. In that case, the test function fails as no initial states for the RNN are passed.

ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor 'default_policy/Placeholder:0' shape=(?, 16) dtype=float32>, <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 16) dtype=float32>], got []

Tried to obtain the initial states from LSTMWrapper.get_initial_state(), but couldn't find any parameter in the model nor in the main config to send those.

So maybe there is another way around it.

I expected the possibility to set workers and learner to zero (in tune.run(...) ), and only have evaluation_workers to test only, but tensorboard shows different scores for several evaluation episodes, which shouldn't be the case for a discrete env and no learning.

LecJackS on 29 Jun 2020

Found another way, maybe the "expected" one:

Needed to set "in_evaluation": True to set it in "evaluation mode", as stated here:
https://docs.ray.io/en/master/rllib-training.html#common-parameters

Also added "explore": False in evaluation config (not sure if it's necesary).

config = {"env": customEnv,
        "env_config": {'mode': "train"},
        "num_workers": 0,
        "num_gpus": 0,
        "in_evaluation": True,
        "evaluation_num_workers": 1,
        # Custom eval function
        "custom_eval_function": custom_eval_function,
        # Enable evaluation, once per 10 training iteration.
        "evaluation_interval": 10,
        # Run 1 episode each time evaluation runs.
        "evaluation_num_episodes": 1,
        # Override the env config for evaluation.
        "evaluation_config": {
            "explore": False,
            "env_config": {
                # Use test set to evaluate
                'mode': "test"}
        }
        ...
}

Edit: To see all values returned at the end of the experiment:

results._retrieve_rows(metric='evaluation/episode_reward_mean')

https://github.com/ray-project/ray/blob/master/python/ray/tune/analysis/experiment_analysis.py#L154

Edit2: After trying for a while,I'm not sure that "in_evaluation": True has that purpose. I'll try to stick to Stefan's solution and find a workaround when using an rnn model.

LecJackS on 29 Jun 2020

I know you closed this issue but this simple workflow in official documentation would be a huge boon.