Hi,
I would like to propose to add observation normalization to current DDPG implementation (like here). By the docs I see, that we need to use MeanStdFilter.
So, what are the best places to add this feature to DDPG implementation with correct save-load mechanism?
Thanks!
PS. Are you planing to add D4PG implementation to Ray (cause it what I really want)?
Hm, I think the only change you need to make is to call FilterManager.synchronize() every once in a while in the train() method of DQNAgent (A3C example: https://github.com/ray-project/ray/blob/d01dc9e22d5e8625ae6ac49e2e689eebf472b5f8/python/ray/rllib/agents/a3c/a3c.py#L102). Otherwise the filters of the workers could drift from each other. Note that the synchronizing filters acts as a global barrier so you shouldn't call it more than once an iteration or so (for apex, that's probably every 30s).
You can then enable the observation filter by setting "observation_filter": "MeanStdFilter" in the config. Saving and restoring should work out of the box since DDPG is using the common policy evaluator class.
Re: distributional ddpg, no current plans.
@ericl, a bit more offtopic questions about Ray DDPG implementations (meanwhile, distributional ddpg is in progress).
I see that, you have reset_noise_op to reset OU-noise at the beginning of each sample episode, nevertheless, current DDPG implementations uses Normal noise instead of OU one and never uses reset_noise_op.
I want to add OU noise support and Parameter Noise to current DDPG implementation, but.... I need to rewrite the sampler logic somewhere here if I understand all correctly.
So, I wonder, what is the best way to combine policy_graph update (weights and noise) with sampler end_of_episode. Something like end_of_episode_callback, that can rewrite sampler inners?
On Sun, Jul 29, 2018, 11:16 PM Sergey Kolesnikov notifications@github.com
wrote:
@ericl https://github.com/ericl, a bit more offtopic questions about
Ray DDPG implementations (meanwhile, distributional ddpg is in progress).Cool.
I see that, you have reset_noise_op
http://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py#L162
to reset OU-noise at the beginning of each sample episode, nevertheless,
current DDPG implementations uses Normal noise instead of OU one.I want to add OU noise support and Parameter Noise
https://arxiv.org/abs/1706.01905 to current DDPG implementation,
but.... I need to rewrite the sampler logic somewhere here
https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/sampler.py#L156
if I understand all correctly.So, I wonder, what is the best way to combine policy_graph update (weights
and noise) with sampler end_of_episode. Something like
end_of_episode_callback, that can rewrite sampler inners?That makes sense, a method on the policy graph that is called by the
sampler on episode end? It wouldn't quite work if num_envs_per_worker > 1
but we can probably just ignore that case for now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2505#issuecomment-408758139,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SlhlV2JWFJraGJ5MdrIz6AlaZarSks5uLqS0gaJpZM4VlUIj
.
@ericl
Can you please ping me as soon as this method would be available? or add any helpful comments of it's implementation? I want to add this features before the end of this week and test all of them during next one.
Can you try out something like this?
diff --git a/python/ray/rllib/evaluation/policy_graph.py b/python/ray/rllib/evaluation/policy_graph.py
index 32534d7..49cc0ad 100644
--- a/python/ray/rllib/evaluation/policy_graph.py
+++ b/python/ray/rllib/evaluation/policy_graph.py
@@ -89,6 +89,9 @@ class PolicyGraph(object):
"""
return sample_batch
+ def on_episode_end(self):
+ pass
+
def compute_gradients(self, postprocessed_batch):
"""Computes gradients against a batch of experiences.
diff --git a/python/ray/rllib/evaluation/sampler.py b/python/ray/rllib/evaluation/sampler.py
index 6ae66e6..c0cae48 100644
--- a/python/ray/rllib/evaluation/sampler.py
+++ b/python/ray/rllib/evaluation/sampler.py
@@ -278,6 +278,8 @@ def _env_runner(async_vector_env,
if all_done:
# Handle episode termination
+ for policy in policies.values():
+ policy.on_episode_end()
batch_builder_pool.append(episode.batch_builder)
del active_episodes[env_id]
resetted_obs = async_vector_env.try_reset(env_id)
@ericl
But how can sampler send sample_batch of observations to policy? Cause, in the end, I need something like this.
And even if I can use self.sess I can't find observations in the policy graph.
PS. hope this would be the last challenge for param noise implementations.
If you just need to look at the batches to compute statistics, you can do that in the postprocess_trajectory() method of the policy graph (if needed, you can buffer the batches in memory until on_episode_end() is called, though that would use more memory).
postprocess_trajectory looks good, but as I can understand it contains only samples after one full episode; but param noise uses uncorrelated samples from whole buffer for better statistics.
You can collect statistics over time right? They won't be synchronized
across workers though--if that matters, we can add a synchronization
mechanism similar to that used by filters.
On Tue, Jul 31, 2018, 6:38 AM Sergey Kolesnikov notifications@github.com
wrote:
postprocess_trajectory looks good, but as I can understand it contains
only samples after one full episode; but param noise uses uncorrelated
samples from whole buffer for better statistics.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2505#issuecomment-409223936,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SrlSM9EbEQdbEs3r_yhEGE8royoKks5uMF3igaJpZM4VlUIj
.
@Scitator Could you please let me know if the parameter noise feature is ready? I think this feature is crucial for some training tasks. Thank you!
@RodgerLuo sorry, but because of the competitions needs I had to move to personal RL implementations. and again, because of the competitions, I can't share them for now.
Nevertheless, our previous year implementation of param noise is quite good - github.com/fgvbrt/nips_rl/blob/farm/pyro_farm/sampler.py#L15
@Scitator thanks for letting me know and sharing the previous link!