_Problem description:_
Suppose we can instantiate several environment simulators with predefined dynamics (source, or train tasks) and an instance of environment with slightly modified dynamics (target, or test task);
What we aim is to run policy optimization on episodes from source domain and periodically check agent performance progress on sample from target environment, such as:
repeat:
1. run policy optimization for ~10 episodes from source environment instances;
2. run ~1 episode of policy evaluation on episode from target environment;
It is desirable to run evaluation task as separate worker to prevent knowledge leakages (as opposed to 'just-set-trainer-learning-rate-to-zero' approach).
It is also highly desirable to run whole experiment from TUNE python API and log run results under 'evaluate' tag to tensorboard summaries.
_Question_:
Is there any predefined solution for setting such a workflow? If no, is there a suggested way to implement this?
My search through docs only returned checkpoint/load/evaluate routine from command line API:
https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies
[Possibly] related: #2799, #4569 and #4496
The closest thing might be the "evaluation_interval" setting in DQN, which will periodically run episodes with epsilon=0 and log them under a separate evaluation/ metric key.
This is something that could potentially be generalized to other algorithms as well. Do you see some easy way of achieving this? Perhaps we can allow a separate evaluation config to be specified that allows various config settings including env to be overriden?
"evaluation_interval" setting in DQN, which will periodically run episodes with epsilon=0 and log them under a separate evaluation/ metric key.
Yes indeed, https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L51 is exactly what I meant.
Perhaps we can allow a separate evaluation config
yes, if I understand correctly from the code at https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L211 in addition to policy_graph extra_config one can specify evaluation env_extra_config [partially] overriding env_config and pass it along to env_creator to make local evaluator.
It seems (at first glance) that the only necessary arg needed to change evaluation policy behaviour is to disable exploration. In case of actor-critic architectures it could be achieved by computing actions deterministically (#4496).
yes, if I understand correctly from the code at https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L211 in addition to policy_graph extra_config one can specify evaluation env_extra_config [partially] overriding env_config and pass it along to env_creator to make local evaluator.
Makes sense. I guess this limits you to the same env class, but the env config could be different for evaluation.
It seems (at first glance) that the only necessary arg needed to change evaluation policy behaviour is to disable exploration. In case of actor-critic architectures it could be achieved by computing actions deterministically (#4496).
Yep, we could add a model option to set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero. Then, this could be handled via extra_config as well.
Would you have the time to help implement this?
Would you have the time to help implement this?
yes, I would like to do this but since I'm new to the ray and rllib some guidance would be greatly appreciated:
Commit 1: generalising evaluate loop:
We need:
Am I missing something here?
Commit 2: Disable Exploration:
set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero
Cool, some answers inline:
include _evaluate() call in base class train loop like found in https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/dqn/dqn.py#L280 - ??? think I need an advise here how to properly implement it since base Trainer class does not explicitly overrides ray.tune.Trainable._train()
You could instead add it to the Trainer.train() method which in turn calls Trainable.train which calls Trainer._train().
if I see it correctly we only need to update local evaluator config["env_config"] with extra values for env_cretar to get evaluation-specific parameters;
Sounds good. I would probably add a generic "evaluation_config" to trainer config that can be merged with self.config to override potentially any value, including env_config. You can use the merge_dicts() utility function for that.
set DiagGaussian standard deviation to zero, and also softmax temperature to close to zero
For Categorical, you can multiply self.inputs in the line here by a large number to ensure determinism (or use argmax): https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/action_dist.py#L113
For DiagGaussian, the std would need to be multiplied by zero here for deterministic sampling: https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/action_dist.py#L180
Currently config is not passed all the way through to the action distribution objects themselves, however the config is already passed in get_action_dist(): https://github.com/ray-project/ray/blob/master/python/ray/rllib/models/catalog.py#L94, so it should be possible to add some code in get_action_dist() to return deterministic class variants instead.
@ericl, I currently see two possible options to implement train-evaluate loop:
evaluation_interval and evaluation_num_episodes to become tune-level arguments while extra_evaluation_config (and exploration settings) remains at RLlib Trainer spec level.I think the former is easier for now, since we have yet to fully define
what the evaluation API in tune should look like. Should be easy to move it
onto that later once it's ready.
On Tue, Apr 16, 2019, 5:46 AM Andrew notifications@github.com wrote:
@ericl https://github.com/ericl, I currently see two possible options
to implement train-evaluate loop:
- modify rllib.Trainer.train() to include train-eval loop as
discussed before- Modify tune.Trainable.train() method by including _.train() and if
[eval_condition] _.evaluate() methods.
Later makes periodic evaluation generic tune feature (which imho is
more logical approach wrth overall functional hierarchy):
evaluation_interval and evaluation_num_episodes will become tune-level
arguments while extra_evaluation_config (end exploration settings) remains
at RLlib spec level.
Which one is more appealing? Can be any pitfalls if later approach I'm
unaware of?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4614#issuecomment-483589850,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6Su2VkjEcSWlb4CAVj7wGeZDIyCtuks5vhZvcgaJpZM4csQi1
.
I have a similar problem where I want to train on train tasks and evaluate (while training) on test tasks using actor-critic methods. The issue is closed, so I assume that @Kismuz was able to implement this. However, I was not able to find any documentation on how to use the evaluation_config: {} option.
So how do we pass in the test tasks? Or if the evaluation_config: {} option is not ready yet, where do I have to modify to perform evaluation on test tasks?
@alversafa , you should add two key args ("evaluation_interval" and "evaluation_num_episodes") to env_config to to enable periodic evaluation; you also can add another key "evaluation_config" containing dictionary of top-level env. key args/ Those args override basic (train) env. keys when instantiating evaluators. Simple example of such setup can be found here:
https://github.com/ray-project/ray/blob/master/rllib/tests/test_evaluators.py#L30
@Kismuz , it has been a while, but thank you, it was helpful.
Now I can evaluate the agents periodically. However, digging through the code, I was not able to locate where you set the temperature of softmax policies to zero during evaluation? I assuming that they are acting greedily during evaluation?
Is it possible for you to point me to the code?
@alversafa ,
I was not able to locate where you set the temperature of softmax policies to zero during evaluation?
@Kismuz , I see, thanks!
@ericl, @Kismuz, is there an easy way (without changing much of the code) to achieve this?
Just taking the argmax for the evaluation worker policies will do it, however, i can't locate the place to add it?
@Kismuz, is it possible for you point where I should use this?
@Kismuz,
I made it by creating a deterministic version of the Categorical class in: https://github.com/ray-project/ray/blob/3d9bd64591506c2d3cd79d18c96908c996b52c3f/rllib/models/tf/tf_action_dist.py#L41
where I take the argmax (instead of tf.multinomial(.)) in the following line:
Thanks for all the help.
Most helpful comment
@alversafa , you should add two key args ("evaluation_interval" and "evaluation_num_episodes") to env_config to to enable periodic evaluation; you also can add another key "evaluation_config" containing dictionary of top-level env. key args/ Those args override basic (train) env. keys when instantiating evaluators. Simple example of such setup can be found here:
https://github.com/ray-project/ray/blob/master/rllib/tests/test_evaluators.py#L30