Ml-agents: Contribution: Self-play

Created on 6 Aug 2018  路  13Comments  路  Source: Unity-Technologies/ml-agents

Contribution suggestion: Self-play

We are a group of researchers in the University of York, UK. We are using Unity ML-agents as a testbed for our RL experiments. We are currently implementing a self-play system for Unity ML-agents based on the following paper: Emergent Complexity via Multi-Agent Competition. It is our intention to offer it as a contribution to the existing Unity ML-agents codebase.

Motivation

On the current Unity ML-agents 0.4 version, there are two ways of implementing self-play:

One

Following the example of the Tennis sample environment:

  • Create two or more identical agents.
  • Link each agent to the same brain.
  • Start training.

Issues: We have tested the following disadvantage through local testing:
The agent only gets to play against the current version of itself, overfitting to its own behaviour. And does not generalize to other strategies.

Two

  • Create two or more identical agents
  • Link each agent to its own brain (both brains trained using the same algorithm and only differing in their initial weights)
  • Start training.

Issues:
Early in training one agent eventually becomes dominant and overpowers the other agent for the rest of the training. This means that the stronger agent relies too much on exploiting the weaknesses of the weaker agent.

Our self-play contribution

Theoretical approach

We would like to implement a self-play system inspired by that introduced in Emergent Complexity via Multi-Agent Competition. The overall idea is to keep track of the latest iteration of a policy alongside with checkpoints from training history. Every fixed interval the latest policy is matched against itself or a previous checkpoint. This previous checkpoint is sampled uniformly at random from eligible versions. This approach works not only for 1v1 scenarios, but also for scenarios with many agents in play. One of the agents would be the one training and learning over time. The rest would use policies sampled from previous historical checkpoints.

This is done by introducing two hyper parameters:

  • Delta(饾浛): which takes values between [0,1]. It indicates how much of the policy history will be considered when sampling a new opponent. 饾浛 = 1, only the latest policy will be used, 饾浛 = 0, all of the history will be considered.

  • Opponent policy change interval: positive number. how many episodes will be played out before a new opponent will be sampled.

A graphical representation of the training process can be found below:

self-play-graph

Benefits:

  • Agent trains against a varied set of opponents. Avoiding overfitting to its own strategy and becoming more resilient to different strategies and levels of play. We hypothesize that this will prevent overfitting to a single playstyle, making an AI that can more easily adapt to different players.

  • It becomes easy to monitor if the latest version of the agent can defeat previous (and random) versions of itself. If the training only consists of matches between the latest version of the algorithm and itself, we are not monitoring performance against other possible opponents. Meaning that an increase in overall model performance may be due to overfitting against the agent鈥檚 latest strategy.

Things to Consider:

  • This self-play mechanism requires a history of policies that are created during training. This has the potential of requiring large amounts of storage. A checkpoint of a tensorflow graph using the default Unity-ML agents settings for two agents takes up roughly 2MB of storage. This means that a history of 1000 policies will need 2GB, which may be inconvenient for some users.

  • With the above storage consideration in mind, if we store a set number of historical policies, say n, then there will be an issue that our value of 饾浛 will only matter up until we have stored n policies, after which the sample will only come from some proportion the last n policies. This will lead to a shifting window of historical policies, rather than a window which scales as time goes on. To avoid this we will have to start pruning policies more intelligently and having a more sparse policy history as time goes on. However we don't currently know how to achieve this.

Code contribution

We have a few proposed solutions for the architecture for self-play, two of which involve creating a new brain type. The reason the brain needs to be modified is because we need a way of distinguishing between brains which sample from the history of the policy and represent opponents to the learning brain. The advantage of introducing new brain types as we see it is that it might make things nicer in the unity UI when just adding a brain and having it be of the self-play type.

  • Introduce new brain type CoreBrainInternalSelfPlay: This would introduce a new brain type which is loosely based on the existing CoreBrainInternal, and samples from many pre-saved models. In theory this should keep more of the work in C#, since the randomised sampling can all be done in C# and we should only need minimal changes to the python code. This approach would feature an external agent that learns in the python side, and one or more CoreBrainInternalSelfPlay agents that use TensorFlowSharp to dynamically load checkpoint models created by the external agent as training progresses. Both hyperparameters would be kept as part of the new brain type's class fields.

  • Introduce new brain type CoreBrainExternalSelfPlay: This would introduce a new brain type which is instead based on the existing CoreBrainExternal. This would therefore require some more significant changes to the python code to allow for the external brain to load a randomly sampled historical policy. We think the jumping off point for these changes would be in the start_learning method in the trainer_controller.py class. This method already features a check for episode termination which propagates to the various trainers which we could use to trigger policy resampling. Both hyperparameters would be stored in the python-side hyperparameter file trainer_config.yaml.

  • Modify Current CoreBrainExternal or CoreBrainInternal: This wouldn't introduce a new brain-type, so may be nicer in terms of retaining the simplicity of having 4, clearly defined brain types, but would necessitate some additional parameters to the existing brains, which could be fine with sensible defaults but might make the CoreBrains slightly more confusing to use. It would require similar code to be added to either of the CoreBrain classes as the previous two solutions add to subclasses.

Our preference would be to create one of the new, specialised Brain type, either internal or external. However, we would greatly appreciate some direction on what you guys prefer, and if you think that one of these solutions is more appropriate. There may also be a better solution that we have not considered.

Please let us know if you would be interested in this contribution. We would also welcome any discussions and / or questions from both the Unity team and the community!

discussion request

Most helpful comment

Hello again!

I am currently backpacking through Asia and I have not been able to work on this, sorry for not mentioning that this was not in active development from now.

You are right on the little summary you made. We don't change the C# code, we've only added Python code!

We do plan on upgrading to the "develop" version in the future. Currently we are running some experiments for a submission for a conference (both me and the other contributors are PhD students). Because of this, we wanted to implement a self-play mechanism for our own purposes, and once we were done with that, make it more general / easier to use for others. It also allows us to do two iterations of the self-play framework.

You are 100% correct in that measuring success / performance is tricky in a multiagent environment. Introducing Elo ranking would be a good next move. In case you don't know about it, TrueSkill could be a slightly nicer option. It is basically an elo ranking that has a variance built in, and it scales to team based games, as Elo was only conceived for 1v1 games (it came from chess). Strictly speaking, this feature isn't part of the vanilla self-play concept. But it becomes really useful for more complex self-play applications (i.e, only keep ghost brains that have high Elo).

We will be running our experiments on this next January. Sorry if that's a bit late! If you want, we can show you how to run our current self-play setup, to see if it actually works for your game. We can give you a set of instructions on how to do it if that helps.

I will be more active this week as I have a reliable source of wi-fi, so I will be more active here if you need me / us!

All 13 comments

HI @Danielhp95,

Thanks for making this thread. The idea of implementing a self-play system is something we have been thinking about ourselves for a while. We've been internally referring to the idea as that of a "Ghost Trainer," since the agent would be learning against "ghost" versions of itself. We would be really happy for you to work on this, and also be happy to provide help and feedback so we can ensure that it becomes a feature that can be shared with the entire community.

I have a few initial ideas based on your post. The first is that I actually don't imagine the memory requirements being so great. I think in most cases keeping around only on the order of 10s of previous policies, rather than 100s or 1000s will be necessary (so long as those past policies are representative of the history of behaviors of the agent).

As for the best approach to implement this, I think multiple options can be possible. Internally we had been working with the idea that we wouldn't change or add any Brain types on the Unity side, and rather create a new Trainer in Python (for an external brain), whose sole job is to load and provide inference for previous versions of another model. By keeping things in python, this allows us to keep our Unity API relatively simple. That being said, we have also been discussing internally the possibility for sending policies over to Unity from Python via the API, and in that case something like an "OnlineInternalBrain" could make a lot of sense too.

Hi all!

My team and I have come with a design dilemma that we thought it could be of interest to the community.

As explained in the introductory post of the contribution, this self play system that we want to introduce features a hyperparameter that we have named Opponent Policy Change Interval (OPCI), which states how much time needs to elapse before a new opponent is sampled from the already-computed historical policies.

The dilemma is the following: should this hyperparameter measure the number of episodes or the number of steps before a new opponent is sampled.

Here are a couple arguments for both options:

  1. In favour of OPCI denoting number of episodes:

    • It may be the case that for environments that need long term strategic planning changing an opponent policy in the middle of an episode can "destroy" the work done by that opponent if the new policy cannot exploit it. This is to say, that an opponent can set up a play, be replaced by a new opponent policy, and this new policy fails to exploit the play.
  2. In favour of OPCI denoting number of steps:

    • Works intuitively with infinitely running tasks (non-episodic tasks). Per episode OPCI would need a hack for infinitely running tasks.
    • In the scenario where bad policies lead to longer episodes (for instance, racing environments), per episode OPCI would feature more policy updates against bad policies. Better policies would lead to shorter episodes, which in turn means less environment steps processed, which leads to less model updates.

These are by no means an exhaustive list of arguments for both sides. We welcome a discussion on the topic. To the best of my knowledge there is no real literature on this, but do share papers on this topic if you know of some!

Dani and I have just had a discussion about this OPCI stuff and we've come up with an idea. Apologies if the formatting isn't as nice: I'm on a train writing this on my phone. Anyway, here goes:
Should OPCI actually be two parameters, at least in implementation? By having a second parameter OPCIU (Opponent Policy Change Interval Unit) which is an enum currently consisting of STEP, EPISODE we gain some nice flexibility here, and we can in future extend it in some funky ways.
The most obvious other one which might not have any actual application but could be interesting would be real time e.g. SECOND, but I'm sure there could be some cool other extensions that might become useful.

In any case it doesn't seem necessary to hem ourselves in to either episode or step at this stage.

Hope this makes sense

Both STEP and EPISODE seem to be good ideas. The SECOND seems to be less used in other applications. One thing that could be kept in mind is that we should make our implementation to be able to extend to a complicated method. For example, maybe later we want to include advance sampling method for self-play.

Hello @Danielhp95,

I will need to use this feature for my own project so I was wondering what is the progress on this issue.

I would be happy to contribute or help testing if needed.

Hello @LeSphax

We currently have a raw working implementation of both self-play mechanisms that we proposed at the beginning of the issue. They can be found in the self-play branch of our ml-agents fork: https://github.com/Danielhp95/ml-agents/tree/develop-self-play

We have not really documented our changes (bad practice on our part), but we could put you up to speed on the changes we've made and how to set self-play training on both Unity and Python.

This fork is operating on a slightly dated version of the ml-agents framework, as we are running experiments and analysis on some of our own environments and we didn't want to spend a lot of time keeping up to date with the latest changes in the official ml-agents repository.

I am glad that you want to contribute. Would it be possible for you to be more specific on what it is that you want to do? This way we can give you a more nuanced explanation. I would prefer to keep the communication through Github to encourage collaboration with other people (just like yourself!). But I understand if you would prefer to set a more private means of communication.

Glad to see that you have made good progress. I looked at the changes in the code and it looked quite straightforward. If I understand well, you ended up going with @awjuliani's solution and used ghost trainers and for changing the ghost model you based that on academy resets. Mostly changing python code and keeping the C# pretty much the same. Please correct me if I am wrong :P.

So a bit more about what I am trying to accomplish:

At the moment, I have a game that looks like this: https://www.youtube.com/watch?v=IITLQJbAG8E.
The agent gets a reward when it scores a goal, teaching that wasn't too hard since I can easily track how often the agent scores a goal.

But now that I have started to add a second player I can't track progress anymore except by looking at videos of the game. Because even if my agent was better than a human, it would still win 50% of the time against himself.

So the main problem that I am trying to solve is getting information to make sure that my agent is getting better than older versions of himself (and maybe a scripted baseline).
It seems that usually an Elo ranking system is used to solve this issue.
The additional regularization that comes with playing against older versions would surely benefit me as well but before starting to train self-play, I need to have a way to see if it is working.

So talking about contributing I would want to introduce this Elo ranking. Which doesn't seem trivial because half of the information is in python (i.e which ghost brain is playing?) and the other half is in C#(i.e Who won the match?). I am also not sure if that feature should be part of self-play or not.

But I also want to make sure I will be able to access next versions of ML-agents if I need to. I am concerned that adding more code on your fork will just make it harder to merge later.

Do you have plans to upgrade to the current "develop" version at some point? That's something I could start working on as well 馃憤

Hello again!

I am currently backpacking through Asia and I have not been able to work on this, sorry for not mentioning that this was not in active development from now.

You are right on the little summary you made. We don't change the C# code, we've only added Python code!

We do plan on upgrading to the "develop" version in the future. Currently we are running some experiments for a submission for a conference (both me and the other contributors are PhD students). Because of this, we wanted to implement a self-play mechanism for our own purposes, and once we were done with that, make it more general / easier to use for others. It also allows us to do two iterations of the self-play framework.

You are 100% correct in that measuring success / performance is tricky in a multiagent environment. Introducing Elo ranking would be a good next move. In case you don't know about it, TrueSkill could be a slightly nicer option. It is basically an elo ranking that has a variance built in, and it scales to team based games, as Elo was only conceived for 1v1 games (it came from chess). Strictly speaking, this feature isn't part of the vanilla self-play concept. But it becomes really useful for more complex self-play applications (i.e, only keep ghost brains that have high Elo).

We will be running our experiments on this next January. Sorry if that's a bit late! If you want, we can show you how to run our current self-play setup, to see if it actually works for your game. We can give you a set of instructions on how to do it if that helps.

I will be more active this week as I have a reliable source of wi-fi, so I will be more active here if you need me / us!

Thanks for the update @Danielhp95

Once the term starts again in January, let's chat more about this work. We'd be very happy to have both the self-play system as well as an elo-style ranking built into ML-Agents for everyone to take advantage of!

Hello @Danielhp95, @awjuliani.

Thanks for the true skill suggestion, I will look into it. It seems more complicated to implement than the elo rating but maybe we could use something like that https://pypi.org/project/trueskill/.
Unfortunately, my game is on the version 0.5 of ML agents so I think it could be complicated to use your branch with it.

So instead, I started to implement self-play on the develop branch from scratch. I went for the same idea of creating ghost version of the agent but instead of creating self_play_trainer_controller.py, I created a ghost/trainer.py class and only changed the _initialize_trainers function in trainer_controller.py.
The idea is that instead of setting a self-play flag on a brain you create two brains with one having a reference to the other, something like that:

TennisLearning:
    normalize: true
    max_steps: 2e5

TennisGhost:
    ghost_of: TennisLearning

Then the unity side can decide which agent is using the ghost brain and which agent is using the learning brain.
At the moment the ghost trainer loads 3 policies and assign each agent randomly to one of these policies. On each academy reset it changes the model of each policy to one of the past versions of its master_trainer.

This part works fine at the moment even though it needs more testing and it is not configurable.

I also started on the elo rating part. On each observations, I send a text observation in this form "opponent_agent_id|match_result".
So for example: "1234|win", "1234|loss" or "4321|playing".
It works fine for my current environment with two agents but I will need to format it differently if there are larger teams.

On the python side, the elo rating is a tensorflow variable just like the global step. Which allows to save it with the model and load it back afterward.
Then the text observations are used to find out the results of matches between agents and change the elo rating of their model accordingly.

I think I still have some bugs on this part because when testing on the Tennis environment, the elo rating doesn't seem to increase consistently while the agent's cumulative reward does increase.

So that's what I have been doing these past two weeks, please tell me if you think there is a better way to do this :)

Otherwise, I was planning to continue until I have something that works well for my use case and then improve the interface/configurability before creating a pull request.

Due to inactivity, I am closing this issue for now. Please feel free to re-open if you deem it necessary. We are still interested to hear about self-play progress people may make!

Hello @awjuliani

I spent some more time working on it, so I fixed the issues and created a pull-request #1975 :)

@LeSphax

Thanks! Our team will take a look.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gerardsimons picture gerardsimons  路  3Comments

Rodnyy picture Rodnyy  路  3Comments

dlindmark picture dlindmark  路  3Comments

jlanis picture jlanis  路  4Comments

MarcPilgaard picture MarcPilgaard  路  3Comments