Ray: [rllib] Periodic policy evaluation in the course of training

Created on 12 Apr 2019 · 16Comments · Source: ray-project/ray

_Problem description:_
Suppose we can instantiate several environment simulators with predefined dynamics (source, or train tasks) and an instance of environment with slightly modified dynamics (target, or test task);

What we aim is to run policy optimization on episodes from source domain and periodically check agent performance progress on sample from target environment, such as:

repeat:
   1. run policy optimization for ~10 episodes from source environment instances;
   2. run ~1 episode of policy evaluation on episode from target environment;

It is desirable to run evaluation task as separate worker to prevent knowledge leakages (as opposed to 'just-set-trainer-learning-rate-to-zero' approach).
It is also highly desirable to run whole experiment from TUNE python API and log run results under 'evaluate' tag to tensorboard summaries.

_Question_:
Is there any predefined solution for setting such a workflow? If no, is there a suggested way to implement this?

My search through docs only returned checkpoint/load/evaluate routine from command line API:
https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies

[Possibly] related: #2799, #4569 and #4496

Source

Kismuz

👍2

Most helpful comment

@alversafa , you should add two key args ("evaluation_interval" and "evaluation_num_episodes") to env_config to to enable periodic evaluation; you also can add another key "evaluation_config" containing dictionary of top-level env. key args/ Those args override basic (train) env. keys when instantiating evaluators. Simple example of such setup can be found here:

https://github.com/ray-project/ray/blob/master/rllib/tests/test_evaluators.py#L30

Kismuz on 30 Dec 2019

👍2

All 16 comments

The closest thing might be the "evaluation_interval" setting in DQN, which will periodically run episodes with epsilon=0 and log them under a separate evaluation/ metric key.

This is something that could potentially be generalized to other algorithms as well. Do you see some easy way of achieving this? Perhaps we can allow a separate evaluation config to be specified that allows various config settings including env to be overriden?

ericl on 12 Apr 2019

"evaluation_interval" setting in DQN, which will periodically run episodes with epsilon=0 and log them under a separate evaluation/ metric key.

Yes indeed, https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L51 is exactly what I meant.

Perhaps we can allow a separate evaluation config

yes, if I understand correctly from the code at https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L211 in addition to policy_graph extra_config one can specify evaluation env_extra_config [partially] overriding env_config and pass it along to env_creator to make local evaluator.

It seems (at first glance) that the only necessary arg needed to change evaluation policy behaviour is to disable exploration. In case of actor-critic architectures it could be achieved by computing actions deterministically (#4496).

Kismuz on 13 Apr 2019

👍1

yes, if I understand correctly from the code at https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L211 in addition to policy_graph extra_config one can specify evaluation env_extra_config [partially] overriding env_config and pass it along to env_creator to make local evaluator.

Makes sense. I guess this limits you to the same env class, but the env config could be different for evaluation.

It seems (at first glance) that the only necessary arg needed to change evaluation policy behaviour is to disable exploration. In case of actor-critic architectures it could be achieved by computing actions deterministically (#4496).

Yep, we could add a model option to set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero. Then, this could be handled via extra_config as well.

Would you have the time to help implement this?

ericl on 13 Apr 2019

Would you have the time to help implement this?

yes, I would like to do this but since I'm new to the ray and rllib some guidance would be greatly appreciated:

Commit 1: generalising evaluate loop:
We need:

update COMMON_CONFIG at https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/trainer.py#L41
with ==Evaluation== block from DQN config and remove it from https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L51 as redundant
implement _evaluate() method found in https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L299 for base Trainer class
include _evaluate() call in base class train loop like found in https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L280 - ??? think I need an advise here how to properly implement it since base Trainer class does not explicitly overrides ray.tune.Trainable._train()
update base Trainer with evaluation_ev attr. like found in https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L211 and implement _evaluate() method like that of https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L299
if I see it correctly we only need to update local evaluator config["env_config"] with extra values for env_cretar to get evaluation-specific parameters;

Am I missing something here?

Commit 2: Disable Exploration:

set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero

haven't dig in it yet. Can you please point me out?

Kismuz on 14 Apr 2019

Cool, some answers inline:

include _evaluate() call in base class train loop like found in https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L280 - ??? think I need an advise here how to properly implement it since base Trainer class does not explicitly overrides ray.tune.Trainable._train()

You could instead add it to the Trainer.train() method which in turn calls Trainable.train which calls Trainer._train().

if I see it correctly we only need to update local evaluator config["env_config"] with extra values for env_cretar to get evaluation-specific parameters;

Sounds good. I would probably add a generic "evaluation_config" to trainer config that can be merged with self.config to override potentially any value, including env_config. You can use the merge_dicts() utility function for that.

set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero

For Categorical, you can multiply self.inputs in the line here by a large number to ensure determinism (or use argmax): https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/action_dist.py#L113

For DiagGaussian, the std would need to be multiplied by zero here for deterministic sampling: https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/action_dist.py#L180

Currently config is not passed all the way through to the action distribution objects themselves, however the config is already passed in get_action_dist(): https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/catalog.py#L94, so it should be possible to add some code in get_action_dist() to return deterministic class variants instead.

ericl on 14 Apr 2019

👍1

@ericl, I currently see two possible options to implement train-evaluate loop:

modify rllib.Trainer.train() to include train-eval loop as discussed before
Modify tune.Trainable.train() method by including _.train() and if [eval_condition] _.evaluate() methods in main body.
Later makes periodic evaluation generic tune feature (which imho is more logical approach w.r.t. overall functional hierarchy): evaluation_interval and evaluation_num_episodes to become tune-level arguments while extra_evaluation_config (and exploration settings) remains at RLlib Trainer spec level.
Only abstraction break I see here is that Trainable is forced to explicitly use ''evaluation_interval' and 'evaluation_num_episodes' config keys while it is generally config agnostic.
Which one is more appealing? Can be any pitfalls if later approach I'm unaware of?

Kismuz on 16 Apr 2019

I think the former is easier for now, since we have yet to fully define
what the evaluation API in tune should look like. Should be easy to move it
onto that later once it's ready.

On Tue, Apr 16, 2019, 5:46 AM Andrew notifications@github.com wrote:

@ericl https://github.com/ericl, I currently see two possible options
to implement train-evaluate loop:

modify rllib.Trainer.train() to include train-eval loop as
discussed before

Modify tune.Trainable.train() method by including _.train() and if
[eval_condition] _.evaluate() methods.
Later makes periodic evaluation generic tune feature (which imho is
more logical approach wrth overall functional hierarchy):
evaluation_interval and evaluation_num_episodes will become tune-level
arguments while extra_evaluation_config (end exploration settings) remains
at RLlib spec level.
Which one is more appealing? Can be any pitfalls if later approach I'm
unaware of?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4614#issuecomment-483589850,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6Su2VkjEcSWlb4CAVj7wGeZDIyCtuks5vhZvcgaJpZM4csQi1
.

ericl on 16 Apr 2019

I have a similar problem where I want to train on train tasks and evaluate (while training) on test tasks using actor-critic methods. The issue is closed, so I assume that @Kismuz was able to implement this. However, I was not able to find any documentation on how to use the evaluation_config: {} option.

So how do we pass in the test tasks? Or if the evaluation_config: {} option is not ready yet, where do I have to modify to perform evaluation on test tasks?

alversafa on 28 Dec 2019

https://github.com/ray-project/ray/blob/master/rllib/tests/test_evaluators.py#L30

Kismuz on 30 Dec 2019

👍2

@Kismuz , it has been a while, but thank you, it was helpful.

Now I can evaluate the agents periodically. However, digging through the code, I was not able to locate where you set the temperature of softmax policies to zero during evaluation? I assuming that they are acting greedily during evaluation?

Is it possible for you to point me to the code?

alversafa on 10 Feb 2020

@alversafa ,