The tf loss is defined as:
self.loss = -tf.log(self.picked_action_prob) * self.target
Where self.target
is the advantage function.
But as I understand it, in policy gradient methods, that's the actual gradient of parameters, not some loss. Shouldn't the step just be :
So why are we minimizing the loss? Is it just a trick since grad= 0 is a minimum point? Why use tf at all?
The goal of loss-function is to find the minimum loss; Policy Gradient Methods assure each iteration step going to the right-way(minimizing loss). If only looking at the gradient=0, it could lead to a local/globe maximum value.
Thanks for the answer, but that's not exactly what I asked.
I think I got it now, but I'd like a confirmation:
For a table lookup (like the first exercise), minimizing the loss is unnecessary, you could just analitically derive and explicit expression for :
grad(log(pi(s,a))) = f(s,a)
then iterate the policy weights (theta) directly, the way David Silver explains in lecture 7 (the implicit formula in my original question)
However, for a deeper and more complicated NN (with general state input , not one-hot) this is messy, you'd have to essentially do the equivalent of analytically deriving and calculating full backprop yourself. So it's simpler to formulate a "psuedo-loss" function, and let Tensorflow calculate the gradient automatically , which is exactly
In addition, tensorflow does the iterative step for all weight, using whatever algorithm you want (not just +a*delta)
So, the loss itself doesn't have a special meaning, the important gradient which is implicit in the optimizer.
Correct?
@ArikVoronov your intuition is correct. This lecture on policy gradients looks into this problem in slide 28.