Ray: [rllib] DDPG observation normalization

Created on 29 Jul 2018 · 13Comments · Source: ray-project/ray

Hi,

I would like to propose to add observation normalization to current DDPG implementation (like here). By the docs I see, that we need to use MeanStdFilter.

So, what are the best places to add this feature to DDPG implementation with correct save-load mechanism?

Thanks!

PS. Are you planing to add D4PG implementation to Ray (cause it what I really want)?

question rllib

Source

Scitator

All 13 comments

Hm, I think the only change you need to make is to call FilterManager.synchronize() every once in a while in the train() method of DQNAgent (A3C example: https://github.com/ray-project/ray/blob/d01dc9e22d5e8625ae6ac49e2e689eebf472b5f8/python/ray/rllib/agents/a3c/a3c.py#L102). Otherwise the filters of the workers could drift from each other. Note that the synchronizing filters acts as a global barrier so you shouldn't call it more than once an iteration or so (for apex, that's probably every 30s).

You can then enable the observation filter by setting "observation_filter": "MeanStdFilter" in the config. Saving and restoring should work out of the box since DDPG is using the common policy evaluator class.

ericl on 29 Jul 2018

👍1

Re: distributional ddpg, no current plans.

ericl on 29 Jul 2018

@ericl, a bit more offtopic questions about Ray DDPG implementations (meanwhile, distributional ddpg is in progress).

I see that, you have reset_noise_op to reset OU-noise at the beginning of each sample episode, nevertheless, current DDPG implementations uses Normal noise instead of OU one and never uses reset_noise_op.

I want to add OU noise support and Parameter Noise to current DDPG implementation, but.... I need to rewrite the sampler logic somewhere here if I understand all correctly.

So, I wonder, what is the best way to combine policy_graph update (weights and noise) with sampler end_of_episode. Something like end_of_episode_callback, that can rewrite sampler inners?

Scitator on 30 Jul 2018

On Sun, Jul 29, 2018, 11:16 PM Sergey Kolesnikov notifications@github.com
wrote:

@ericl https://github.com/ericl, a bit more offtopic questions about
Ray DDPG implementations (meanwhile, distributional ddpg is in progress).

Cool.

I see that, you have reset_noise_op
http://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py#L162
to reset OU-noise at the beginning of each sample episode, nevertheless,
current DDPG implementations uses Normal noise instead of OU one.

I want to add OU noise support and Parameter Noise
https://arxiv.org/abs/1706.01905 to current DDPG implementation,
but.... I need to rewrite the sampler logic somewhere here
https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/sampler.py#L156
if I understand all correctly.

So, I wonder, what is the best way to combine policy_graph update (weights
and noise) with sampler end_of_episode. Something like
end_of_episode_callback, that can rewrite sampler inners?

That makes sense, a method on the policy graph that is called by the
sampler on episode end? It wouldn't quite work if num_envs_per_worker > 1
but we can probably just ignore that case for now.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2505#issuecomment-408758139,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SlhlV2JWFJraGJ5MdrIz6AlaZarSks5uLqS0gaJpZM4VlUIj
.

ericl on 30 Jul 2018

@ericl

Can you please ping me as soon as this method would be available? or add any helpful comments of it's implementation? I want to add this features before the end of this week and test all of them during next one.

Scitator on 30 Jul 2018

Can you try out something like this?

diff --git a/python/ray/rllib/evaluation/policy_graph.py b/python/ray/rllib/evaluation/policy_graph.py
index 32534d7..49cc0ad 100644
--- a/python/ray/rllib/evaluation/policy_graph.py
+++ b/python/ray/rllib/evaluation/policy_graph.py
@@ -89,6 +89,9 @@ class PolicyGraph(object):
         """
         return sample_batch

+    def on_episode_end(self):
+        pass
+
     def compute_gradients(self, postprocessed_batch):
         """Computes gradients against a batch of experiences.

diff --git a/python/ray/rllib/evaluation/sampler.py b/python/ray/rllib/evaluation/sampler.py
index 6ae66e6..c0cae48 100644
--- a/python/ray/rllib/evaluation/sampler.py
+++ b/python/ray/rllib/evaluation/sampler.py
@@ -278,6 +278,8 @@ def _env_runner(async_vector_env,

             if all_done:
                 # Handle episode termination
+                for policy in policies.values():
+                    policy.on_episode_end()
                 batch_builder_pool.append(episode.batch_builder)
                 del active_episodes[env_id]
                 resetted_obs = async_vector_env.try_reset(env_id)

ericl on 30 Jul 2018

👍1

@ericl
But how can sampler send sample_batch of observations to policy? Cause, in the end, I need something like this.
And even if I can use self.sess I can't find observations in the policy graph.

PS. hope this would be the last challenge for param noise implementations.

Scitator on 31 Jul 2018

If you just need to look at the batches to compute statistics, you can do that in the postprocess_trajectory() method of the policy graph (if needed, you can buffer the batches in memory until on_episode_end() is called, though that would use more memory).

ericl on 31 Jul 2018

postprocess_trajectory looks good, but as I can understand it contains only samples after one full episode; but param noise uses uncorrelated samples from whole buffer for better statistics.

Scitator on 31 Jul 2018

You can collect statistics over time right? They won't be synchronized
across workers though--if that matters, we can add a synchronization
mechanism similar to that used by filters.

On Tue, Jul 31, 2018, 6:38 AM Sergey Kolesnikov notifications@github.com
wrote:

postprocess_trajectory looks good, but as I can understand it contains
only samples after one full episode; but param noise uses uncorrelated
samples from whole buffer for better statistics.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2505#issuecomment-409223936,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SrlSM9EbEQdbEs3r_yhEGE8royoKks5uMF3igaJpZM4VlUIj
.

ericl on 31 Jul 2018

@Scitator Could you please let me know if the parameter noise feature is ready? I think this feature is crucial for some training tasks. Thank you!

RodgerLuo on 27 Aug 2018

@RodgerLuo sorry, but because of the competitions needs I had to move to personal RL implementations. and again, because of the competitions, I can't share them for now.

Nevertheless, our previous year implementation of param noise is quite good - github.com/fgvbrt/nips_rl/blob/farm/pyro_farm/sampler.py#L15

Scitator on 27 Aug 2018

@Scitator thanks for letting me know and sharing the previous link!

RodgerLuo on 27 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings