Reinforcement-learning: Policy Gradient Methods: Loss function of policy estimator in REINFORCE

Created on 22 Oct 2018  路  3Comments  路  Source: dennybritz/reinforcement-learning

The tf loss is defined as:

self.loss = -tf.log(self.picked_action_prob) * self.target
Where self.target is the advantage function.

But as I understand it, in policy gradient methods, that's the actual gradient of parameters, not some loss. Shouldn't the step just be :
image

So why are we minimizing the loss? Is it just a trick since grad= 0 is a minimum point? Why use tf at all?

All 3 comments

The goal of loss-function is to find the minimum loss; Policy Gradient Methods assure each iteration step going to the right-way(minimizing loss). If only looking at the gradient=0, it could lead to a local/globe maximum value.

Thanks for the answer, but that's not exactly what I asked.

I think I got it now, but I'd like a confirmation:

For a table lookup (like the first exercise), minimizing the loss is unnecessary, you could just analitically derive and explicit expression for :

grad(log(pi(s,a))) = f(s,a)

then iterate the policy weights (theta) directly, the way David Silver explains in lecture 7 (the implicit formula in my original question)

However, for a deeper and more complicated NN (with general state input , not one-hot) this is messy, you'd have to essentially do the equivalent of analytically deriving and calculating full backprop yourself. So it's simpler to formulate a "psuedo-loss" function, and let Tensorflow calculate the gradient automatically , which is exactly
image
In addition, tensorflow does the iterative step for all weight, using whatever algorithm you want (not just +a*delta)

So, the loss itself doesn't have a special meaning, the important gradient which is implicit in the optimizer.

Correct?

@ArikVoronov your intuition is correct. This lecture on policy gradients looks into this problem in slide 28.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JulesVerny picture JulesVerny  路  6Comments

rushabhk7 picture rushabhk7  路  6Comments

IbrahimSobh picture IbrahimSobh  路  19Comments

nerdoid picture nerdoid  路  78Comments

gskishan004 picture gskishan004  路  6Comments