Stable-baselines: [question] Custom PPO implementation does not train with Pong

Created on 21 Nov 2019  Â·  5Comments  Â·  Source: hill-a/stable-baselines

Hi

I do not know if you might consider this as a question that I can ask you. I have been working with a PPO agent code that seemed to train for the environment (custom) that I have. However, in order to test how good/bad this implementation of the PPO agent is i am now trying to train atari games.

This however, has not been a very rewarding exercise. I do not get the agent to train. I had to include the

env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)

otherwise it didn't produce any variety in actions. After adding these however, it runs for sometime trying out different actions. In a few epochs though it starts producing the same action over and over again. I cannot seem to understand why. I looked into the stable_baselines code and saw that their were few differences

  1. the network for actor(policy) and critic(value) is the same.
  2. Your network is different (i copied it in my actor and critic networks)
  3. the way loss is calculated is very different. There are a few things which were not in the original paper. Like clipping of value or the way e.g. ppo loss is done. I was using a minimum of the two values (r*Advantage and clipped loss) but you do a maximum of two negative values. I understand that it is the same numerically but didn't understand why it was done like this.
  4. in the rl-baselines-zoo code there are even more wrappers for the environment which come from the wrap_deepmind routine in atari_wrappers.py

The reason I still wish that my agent works is that then i have a similar measure for my environment and the atari environments also. I can understand that one suggestion could be to use the baselines agent but then i suspect that i would not really understand how it works for either the atari games or my custom environment.

I am attaching the zip of the code i am trying to run.
System Info
Describe the characteristic of your environment:

Please don't mind my writing about this issue. If you consider it violates the code for submission, i apologize for this.
I think that your repo is a very good and useful thing for any RL practitioner who wishes to understand why any agent trains and wishes to replicate results to gain confidence that it is not something written once for a publication only.

I will continue hunting this down as there are (As i note also in my request) still differences between the code i stick here compared to yours. It is just that I do not completely follow why all those wrappers were added to the environment and also why the loss is the way it is.

with kind regards
Rohit

question

All 5 comments

Hello,
Did you try using the rl zoo for training it on Atari games?

python train.py --algo ppo2 --env BreakoutNoFrameskip-v4

Hi Antonin

Yes i used them and they train without an issue.

with kind regards
Rohit

On Thu, Nov 21, 2019 at 3:04 PM Antonin RAFFIN notifications@github.com
wrote:

Hello,
Did you try using the rl zoo https://github.com/araffin/rl-baselines-zoo
for training it on Atari games?

python train.py --algo ppo2 --env BreakoutNoFrameskip-v4

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/hill-a/stable-baselines/issues/572?email_source=notifications&email_token=AB5MLRS542EBGBG4XRJJAADQU2IQVA5CNFSM4JQCTV22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2KQEQ#issuecomment-557099026,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB5MLRS2OTDQ4ULS7ILU4QLQU2IQVANCNFSM4JQCTV2Q
.

Like you mentioned yourself, this is not place for tech support. However given that the questions you raised are good questions, I will try to answer them:

1) Yes, it is common to share at least some of the network among policy and value function (intuition is that both extract similar features from image with CNNs, for example)
2) I did not quite understand this question. With default settings the network corresponds to one in original DQN paper (the nature one).
3) Yes, there are some modifications in the loss, but these are disabled by default. Things like advantage normalization were also included in original PPO code, but IIRC this was not mentioned in the paper. As for other quirks (e.g. PPO pi loss), it might be just how author of the code felt was most intuitive to write it.
4) Some Atari environments "need" the preprocessing of some of these wrappers (the deepmind_wrapper) to be easier to learn, and without them the task might be too difficult. This topic has been discussed in some recent-ish papers, but I do not have any of those with me for linking right now.

Your best bet is to start with known, working hyperparameters like ones available in the rl-zoo, and tune from there for your custom environment.

Edit: See better answer below ^^

Sorry I overlooked this issue, so the question is more what are the tricks that make PPO work and more particularly on Atari games?

For atari, the preprocessing and action repeat really matters, you can find a good explanation here.

Regarding PPO, there are different tricks, among them:

  1. Initialization matters (cf this blog post)
  2. the advantage is normalized
  3. the value function is clipped (but you can deactivate that in SB), it should not matter too much.
  4. the policy and value network shared the CNN feature extractor

understand that it is the same numerically but didn't understand why it was done like this

This does not matter, it is the same. For the why you should ask people from OpenAI.

In general, because of all this and the possible bug in the implementation, I would recommend you to use a fully tested implementation (like the one from SB) instead of a custom one, unless you want to learn about how to implement RL (I'm currently writing a PR that may help you too: #536 ).

EDIT: I forgot one point mentioned by @Miffyli , hyperparameters matters a lot too (including number of workers, i.e. number of envs).

Thank you Antonin. That is really useful. I think we need to decide as to
what codebase to continue with. I will use the sources you have sent and
improve my understanding of the 'tailor-made' features in these agents.
Thank you for the PR as well.

with kind regards
Rohit

On Thu, Nov 21, 2019 at 4:07 PM Antonin RAFFIN notifications@github.com
wrote:

Sorry I overlooked this issue, so the question is more what are the tricks
that make PPO work and more particularly on Atari games?

For atari, the preprocessing and action repeat really matters, you can
find a good explanation here
https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/
.

Regarding PPO, there are different tricks, among them:

  1. Initialization matters (cf this blog post
    http://gradientscience.org/policy_gradients_pt1/)
  2. the advantage is normalized
  3. the value function is clipped (but you can deactivate that in SB),
    it should not matter too much.
  4. the policy and value network shared the CNN feature extractor

understand that it is the same numerically but didn't understand why it
was done like this

This does not matter, it is the same. For the why you should ask people
from OpenAI.

In general, because of all this and the possible bug in the
implementation, I would recommend you to use a fully tested implementation
(like the one from SB) instead of a custom one, unless you want to learn
about how to implement RL (I'm currently writing a PR that may help you
too: #536 https://github.com/hill-a/stable-baselines/pull/536 ).

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/hill-a/stable-baselines/issues/572?email_source=notifications&email_token=AB5MLRR77NAHPYQ5CDPAU6TQU2PZVA5CNFSM4JQCTV22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2RCBY#issuecomment-557125895,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB5MLRXFGC2BBXYZWFOPTPTQU2PZVANCNFSM4JQCTV2Q
.

Was this page helpful?
0 / 5 - 0 ratings