Deep Deterministic Policy Gradients (DDPG) and stable Baseline Code is presented here.
The actor-critic networks are created as follows:
normalized_obs = tf.clip_by_value(normalize(self.policy_tf.processed_obs, self.obs_rms),
self.observation_range[0], self.observation_range[1])
# Inputs.
self.actions = tf.placeholder(tf.float32, shape=(None,) + self.action_space.shape, name='actions')
# Create networks and core TF parts that are shared across setup parts.
with tf.variable_scope("model", reuse=False):
self.actor_tf = self.policy_tf.make_actor(normalized_obs)
self.normalized_critic_tf = self.policy_tf.make_critic(normalized_obs, self.actions)
self.normalized_critic_with_actor_tf = self.policy_tf.make_critic(normalized_obs,
self.actor_tf,
reuse=True)
The following image gives a general depiction of actor-critic networks.
I understand that:
actor_tf takes as inputs the normalised observations and goals.What I don't understand:
critic_tf takes as input the normalised observations and goals as well as the set of actions.self.actor_tf to create self.normalized_critic_with_actor_tf.My questions therefore are:
What does self.normalized_critic_with_actor_tf represent in the diagram above and why is it used?
Why are actions being passed to the critic input, instead of the reward as depicted in the diagram above? As defined by OpenAI docs, the observation does not contain the reward.
Hello,
That's a good question.
First, you should better take a look at SAC or TD3 code because DDPG is not the cleanest (part of legacy code from baselines). Or even better, you can read Spinning Up guide.
If you want a more "classic" actor-critic architecture, you can take a look at A2C (Paper) (PPO, ACKTR and TRPO follows a similar architecture).
In that case, the critic is the value function V(s) and is used to compute the advantage: A = R - V(s) where R is the return (sum of discounted reward).
In the case of DDPG (and TD3), it is a bit special because it is using a deterministic policy (cf deterministic policy gradient paper), so the policy \pi (denoted as \mu usually when deterministic) is not represented by a probability distribution (e.g. a Gaussian distribution in the continuous case for A2C), so you cannot compute the log-likelihood of taking an action (log_prob in the code) that is used normally for the policy gradient.
DDPG has two components: the actor which is the deterministic policy \pi and the critic which is the state-value function Q(s, a). The way you update the actor \pi is by computing the gradient of Q(s, \pi(s)). The idea is that the policy can be seen as a continuous equivalent of argmax and so you try to update it such as it takes the action that maximizes the Q-function in a given state.
Back to the code, normalized_observation is just a preprocessing step, normally it is done using a wrapper (VecNormalize for instance) but DDPG implements it using tensorflow. For simplicity, you can assume normalized_observation = observation (which is true when the normalization is deactivated).
So critic_with_actor_tf represents Q(s,\pi(s)) the action-state value in a state s (here observation = state) following the policy pi (the actor) (a = \pi(s)). This is what is used to compute the gradient for the actor:
https://github.com/hill-a/stable-baselines/blob/a1ab7a1c2903e7e1c38756d8cdf7a54a5fd5781e/stable_baselines/ddpg/ddpg.py#L496
If you look at TD3, you find the same kind of update here:
https://github.com/hill-a/stable-baselines/blob/a1ab7a1c2903e7e1c38756d8cdf7a54a5fd5781e/stable_baselines/td3/td3.py#L193
I hope things are clearer now (DDPG is not the easiest one to understand).
Note that the code is slightly more complicated because it has also target networks that are here to improve the stability of the algorithm.
Hello,
That is a great answer, and has made things clearer, thank you.
As a follow up question, I'm trying to understand how in the code, the critic uses the reward from the environment to determine the accuracy of its value prediction (ie: through the error, where the error is the difference between the new estimated value of the previous state from the critic network)
I also seem to be fundamentally missing something important from the DDPG code, which is how the actor critic networks interact with the rollout algorithm (maybe I should open a separate issue for this?)
As I understand, rollout algorithms are used to improve upon the base policy, ie: policy improvement. This to me seems very similar to what is essentially, the actor's job. Therefore my question is:
Is the policy improvement step being carried out by the rollout algorithm ? I don't think this is the case since we have a separate actor network, but can't wrap my head around the use of rollout (to improve policy?) when we have a separate actor network!
Thanks in advance, this thread was very helpful to understand the code provided.
maybe I should open a separate issue for this?
Please read carefully the spinning up guide first and then if you still have questions, please open another issue.
Most helpful comment
Hello,
That's a good question.
First, you should better take a look at SAC or TD3 code because DDPG is not the cleanest (part of legacy code from baselines). Or even better, you can read Spinning Up guide.
If you want a more "classic" actor-critic architecture, you can take a look at A2C (Paper) (PPO, ACKTR and TRPO follows a similar architecture).
In that case, the critic is the value function
V(s)and is used to compute the advantage:A = R - V(s)where R is the return (sum of discounted reward).In the case of DDPG (and TD3), it is a bit special because it is using a deterministic policy (cf deterministic policy gradient paper), so the policy
\pi(denoted as\muusually when deterministic) is not represented by a probability distribution (e.g. a Gaussian distribution in the continuous case for A2C), so you cannot compute the log-likelihood of taking an action (log_probin the code) that is used normally for the policy gradient.DDPG has two components: the actor which is the deterministic policy
\piand the critic which is the state-value functionQ(s, a). The way you update the actor\piis by computing the gradient ofQ(s, \pi(s)). The idea is that the policy can be seen as a continuous equivalent ofargmaxand so you try to update it such as it takes the action that maximizes the Q-function in a given state.Back to the code,
normalized_observationis just a preprocessing step, normally it is done using a wrapper (VecNormalizefor instance) but DDPG implements it using tensorflow. For simplicity, you can assumenormalized_observation = observation(which is true when the normalization is deactivated).So
critic_with_actor_tfrepresentsQ(s,\pi(s))the action-state value in a states(hereobservation = state) following the policypi(the actor) (a = \pi(s)). This is what is used to compute the gradient for the actor:https://github.com/hill-a/stable-baselines/blob/a1ab7a1c2903e7e1c38756d8cdf7a54a5fd5781e/stable_baselines/ddpg/ddpg.py#L496
If you look at TD3, you find the same kind of update here:
https://github.com/hill-a/stable-baselines/blob/a1ab7a1c2903e7e1c38756d8cdf7a54a5fd5781e/stable_baselines/td3/td3.py#L193
I hope things are clearer now (DDPG is not the easiest one to understand).
Note that the code is slightly more complicated because it has also target networks that are here to improve the stability of the algorithm.