Reinforcement-learning: Policy Gradient Methods: Loss function of policy estimator in REINFORCE

Created on 22 Oct 2018 · 3Comments · Source: dennybritz/reinforcement-learning

The tf loss is defined as:

self.loss = -tf.log(self.picked_action_prob) * self.target
Where self.target is the advantage function.

But as I understand it, in policy gradient methods, that's the actual gradient of parameters, not some loss. Shouldn't the step just be :

So why are we minimizing the loss? Is it just a trick since grad= 0 is a minimum point? Why use tf at all?

Source

ArikVoronov

All 3 comments

The goal of loss-function is to find the minimum loss; Policy Gradient Methods assure each iteration step going to the right-way(minimizing loss). If only looking at the gradient=0, it could lead to a local/globe maximum value.

mobil787 on 22 Oct 2018

👎4

Thanks for the answer, but that's not exactly what I asked.

I think I got it now, but I'd like a confirmation:

For a table lookup (like the first exercise), minimizing the loss is unnecessary, you could just analitically derive and explicit expression for :

grad(log(pi(s,a))) = f(s,a)

then iterate the policy weights (theta) directly, the way David Silver explains in lecture 7 (the implicit formula in my original question)

However, for a deeper and more complicated NN (with general state input , not one-hot) this is messy, you'd have to essentially do the equivalent of analytically deriving and calculating full backprop yourself. So it's simpler to formulate a "psuedo-loss" function, and let Tensorflow calculate the gradient automatically , which is exactly

In addition, tensorflow does the iterative step for all weight, using whatever algorithm you want (not just +a*delta)