Stable-baselines: [Question] How best to implement self-play/multiple agents in the same environment?

Created on 31 Jan 2019 · 4Comments · Source: hill-a/stable-baselines

I'm trying to train a model using self play, and really love the work that has been done here so far. I was wondering whether anyone might have some advice about how I might adapt PPO2 to allow for multiple models to play against each other in the same environment.

The overall strategy would be to:

Store N models in a list
Generate an action from each of these models using a single observation
Generate a list of rewards for each of these actions from an environment
Update the models based on these rewards

I have written a custom environment that can take an array of actions, update the game state, and then return a list of rewards for each agent. My main issue is in prying apart the actual model from the interactions with the gym environment. I have been trying to decouple the model from the runner, but it seems as if they are quite tightly intertwined and I'm having a difficult time. Has anyone else played around with this idea before? Or be able to point me in the right direction?

question

Source

brokenloop

👍3

Most helpful comment

Yeah it's still in the commit history.

https://github.com/HumanCompatibleAI/adversarial-policies/tree/99700aab22f99f8353dc74b0ddaf8e5861ff34a5/src/aprl/agents

AdamGleave on 19 May 2020

👍2

All 4 comments

Hello,

I think @AdamGleave tackled that problem in the Adversarial policies repo, you should take a look ;)

araffin on 15 Jun 2019

I never finished the self-play implementation but it might still be worth looking at:https://github.com/HumanCompatibleAI/adversarial-policies/blob/master/src/aprl/agents/ppo_self_play.py

AdamGleave on 16 Jun 2019

👍1

I never finished the self-play implementation but it might still be worth looking at:https://github.com/HumanCompatibleAI/adversarial-policies/blob/master/src/aprl/agents/ppo_self_play.py

@AdamGleave I can't access the page. Is there still an available/public version of it?