Ray: How to Implement Self Play with PPO? [rllib]

Created on 2 Jan 2020 · 29Comments · Source: ray-project/ray

How to Implement Self Play with PPO?

Python: 3.6.9,
TensorFlow: tensorflow-gpu 2.0.0
Ray: ray 0.8.0.dev6
OS: Ubuntu 18.04.2

I'm trying to implement a self-play training strategy with PPO similar to the efforts of OpenAI's Five (Dota) and DeepMind's FTW (Capture-the-flag). My understanding is that these methods train a policy in a competitive manner: the agent plays a game against itself (same policy) as well as a mixture of prior policies. In RLlib terms, each iteration would have the trainer sample the adversary's policy from a distribution of policies. For example:

agent_0: policy_0 = 100%

agent_1: policy_0 = 85%
policy_1 = 5%
policy_2 = 5%
policy_3 = 5%

Policy_0 is the main policy that is being trained and the other policies are older versions of this policy, perhaps updated with the weights of newer policy network every 5 iterations. This training strategy could be used with tasks/games that are not necessarily competitive in the real world but could take advantage of the increased policy gradient. To do so would require changing the game to a multi-agent environment and augmenting the reward scheme with an extra reward for the winner and a punishment for the loser. I've used this line of logic with my custom environment and implemented the following training script which uses PPO to perform the policy optimization. However, I get an error which appears to be related to how Tensorflow is defining the graphs of each policy network. I'd appreciate any help in understanding how my script could be fixed to implement this type of training correctly.

Error

Traceback (most recent call last):                                                                                    
  File "run_PPO_multi_selfplay.py", line 233, in <module>                                   
    "policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],                                                 
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 705, in set_weights                                                                                                              
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 533, in set_weights                                                                                                   
    self.policy_map[pid].set_weights(w)                                                                              
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/policy/tf_policy.py", line 269, in set_weights                                                                                                            
    return self._variables.set_weights(weights)                                                                       
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/experimental/tf_utils.py", line 189, in set_weights                                                                                                             
    assert assign_list, ("No variables in the input matched those in the "
AssertionError: No variables in the input matched those in the network. Possible cause: Two networks were defined in the same TensorFlow graph. To fix this, place each network definition in its own tf.Graph.

Training Script

FYI: I disabled the policy weight updates at the bottom in order to troubleshoot, but I'd like to get that working as well.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import contextlib
import gym
import os
import datetime
import sys

import argparse
import numpy as np

import ray
from ray import tune
from ray.tune import run_experiments, register_env
from ray.rllib.models import ModelCatalog

from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_policy import PPOTFPolicy
from ray.tune.logger import pretty_print

#####################################################
#Custom Model
from gym.spaces import Box, Discrete, Dict

from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet_v2 import FullyConnectedNetwork
from ray.rllib.models.tf.misc import normc_initializer
from ray.rllib.utils.annotations import override, DeveloperAPI
from ray.rllib.utils import try_import_tf

tf = try_import_tf()

class MaskedActions(TFModelV2):
    """Custom RLlib model that emits -inf logits for invalid actions.

    This is used to handle the variable-length action space.
    """
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name, **kw):
        super(MaskedActions, self).__init__(obs_space, action_space, num_outputs, model_config, name, **kw)

        self.fc_model = FullyConnectedNetwork(
            Box(-1, 1, shape=(9, )), 
            action_space, 
            num_outputs,
            model_config, name + "_fc")
        self.register_variables(self.fc_model.variables())

    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        #Forward pass through fully connected network
        action_logits, _ = self.fc_model({
            "obs": input_dict["obs"]["obs"]
        })

        # Mask out invalid actions (use tf.float32.min for stability)
        inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
        return action_logits + inf_mask, state

    def value_function(self):
        return self.fc_model.value_function()
#####################################################################

def policy_mapping_fn(agent_id):
    if agent_id.startswith("agent_01"):
        return "policy_01" # Choose 01 policy for agent_01
    else:
        return np.random.choice(["policy_01", "policy_02", "policy_03", "policy_04"],1,
                                p=[.8, .2/3, .2/3, .2/3])[0]

parser = argparse.ArgumentParser()
parser.add_argument("--num-iters", type=int, default=300)
parser.add_argument("--num-workers", type=int, default=15)
parser.add_argument("--num-envs-per-worker", type=int, default=20)
parser.add_argument("--num-gpus", type=int, default=4)
args = parser.parse_args()

ray.init()

register_env("custom_env", lambda custom_args: gym.make('gym_custom_env:env-v0', configDict=env_config))
ModelCatalog.register_custom_model("mask_model", MaskedActions)

#Make gym env so we can define act/obs spaces
# identify our action spaces
single_env = gym.make('gym_custom_env:env-v0', configDict=env_config)
obs_space = single_env.observation_space
act_space = single_env.action_space

ppo_trainer = PPOTrainer(
    env="custom_env",
    config={
        "num_workers": args.num_workers,
        "num_envs_per_worker": args.num_envs_per_worker,
        "num_gpus": args.num_gpus,
        "ignore_worker_failures": True,
        "train_batch_size": 100000,
        "sgd_minibatch_size": 10000,
        "lr": 3e-4,
        "lambda": .95,
        "gamma": .998,
        "entropy_coeff": 0.01,
        "kl_coeff": 1.0,
        "clip_param": 0.2,
        "num_sgd_iter": 10,
        "observation_filter": "NoFilter",  # breaks the action mask
        #"vf_share_layers": True,
        "vf_loss_coeff": 1e-4,    #VF loss is error^2, so it can be really out of scale compared to the policy loss. 
                                      #Ref: https://github.com/ray-project/ray/issues/5278
        "vf_clip_param": 100.0,
        "model": {
            "custom_model": "mask_model",
            "fcnet_hiddens": [512],
        },
        "multiagent": {
            "policies": {
                "policy_01": (None, obs_space, act_space, {}),
                "policy_02": (None, obs_space, act_space, {}),
                "policy_03": (None, obs_space, act_space, {}),
                "policy_04": (None, obs_space, act_space, {})
            },
            "policy_mapping_fn": tune.function(policy_mapping_fn),
            #"policies_to_train": ["policy_01"]
        },
        "callbacks": {
        "on_episode_start": tune.function(on_episode_start),
        "on_episode_step": tune.function(on_episode_step),
        "on_episode_end": tune.function(on_episode_end)
        },
    })

for i in range(args.num_iters):
    print(pretty_print(ppo_trainer.train()))
    '''
    if i % 5 == 0:
        ppo_trainer.set_weights({"policy_04": ppo_trainer.get_weights(["policy_03"])["policy_03"],
                                 "policy_03": ppo_trainer.get_weights(["policy_02"])["policy_02"],
                                 "policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],
                                })
    '''

question rllib

Source

josjo80

👍6

Most helpful comment

@ericl and @josjo80 I think I'm tracking what you're saying, and it seems like the preferable way to handle this. @josjo80, I believe you are correct regarding the policies_to_train requirement.

To be clear, the approach entails:

Define a trainable policy and several other non-trainable policies up front. The non-trainable policies will be the "prior selves" and we will update them as we train. Also define the sampling distribution for the non-trainable policies in the policy mapping function like @josjo80 did above.
Train until a certain metric is met (trainable policy wins greater than 60% of the time).
Update a list of "prior selves" weights that can be sampled from to update each of the non-trainable policies.
Update the weights of the non-trainable policies by sampling from the list of "prior selves" weights.
Back to step 2. Continue process until agent is satisfactorily trained.

Any additions or things I missed? Thanks!

rhefron on 22 Mar 2020

👍4

All 29 comments

Nothing jumps out at me from the code above. Can you make the script runnable to reproduce?

ericl on 4 Jan 2020

@josjo80, I ran into this same problem. Have you made any more progress since your initial post?

I believe the issue is that Tensorflow assigns different variable names to each network even if the networks have identical shapes (it increments a number to the variable names which show up in the dictionary) since they are in the same graph. See here for more details: https://ray.readthedocs.io/en/latest/using-ray-with-tensorflow.html.

Since we want to enable setting weights in a separate policy, I expect we'll need to figure out a workaround for this issue (I haven't found one yet).

@ericl, Do you have any suggestions on where to adjust the creation/handling of graphs for each policy in a multi-agent environment, or have any other suggestions for a good workaround? Also, I am not opposed to working in PyTorch if a simpler solution would present itself in that case. Thanks in advance!

rhefron on 10 Mar 2020

Perhaps using the keras save/load API would work? There is a PR here you can try (not yet merged): https://github.com/ray-project/ray/pull/7482

ericl on 10 Mar 2020

Hi @rhefron and @ericl . I haven't made much progress on this problem. I had to move on to another project but I do want to try the PR fix that Eric mentions above. I think you're analysis is correct, so hopefully the new API will work. I'll see if I can get my old custom environment back up and running and pull in the fix. I'm really interested in getting a self-play demo running in which the agent's aren't just playing against their current selves but also prior selves. Thanks for both of your thoughts on this!

josjo80 on 11 Mar 2020

@ericl and @josjo80 Thanks for the thoughts. Time permitting, I'll try this in the next week or so and let you know what I find out. Another idea I might try is to create a Keras model and then just use the following since the np arrays should be the same size:

model.get_weights() [returns a list of all weight tensors in the model, as Numpy arrays]
model.set_weights(weights) [sets the values of the weights of the model, from a list of Numpy]
https://keras.io/models/about-keras-models/

rhefron on 13 Mar 2020

Hi, I got the same AssertionError while trying to swap/set weights between
policies after certain training iterations in a multi-agent environment.

It turns out that the names of the keys in the weight dictionary of each policy
are different. The shape of their values are the same though.

Is the code snippet below doing something similar to what you're looking for?

Copy weights from "policy_1" to "policy_0" after each training iteration
while keeping weights of "policy_1" unchanged:

for i in range(3): # train iter
     result = trainer.train()

     P0key_P1val = {} # temp storage with "policy_0" keys & "policy_1" values
     for (k,v), (k2,v2) in zip(trainer.get_policy("policy_0").get_weights().items(),
                               trainer.get_policy("policy_1").get_weights().items()):
         P0key_P1val[k] = v2

     # set weights
     trainer.set_weights({"policy_0":P0key_P1val, # weights or values from "policy_1" with "policy_0" keys
                          "policy_1":trainer.get_policy("policy_1").get_weights() # no change
                          })

     # To check
     for (k,v), (k2,v2) in zip(trainer.get_policy("policy_0").get_weights().items(),
                               trainer.get_policy("policy_1").get_weights().items()):
         assert (v == v2).all()

ChuaCheowHuan on 20 Mar 2020

👍1

Hi @ChuaCheowHuan ! Thanks for the input. Yes, that is similar to what we are trying to achieve. I've re-written my script with your recommendations and hope to be able to test out the implementation today or Monday. I'll let you know how it turns out after debugging.

josjo80 on 20 Mar 2020

@ChuaCheowHuan Thank you! That is exactly what I was looking for. I initially started looking into a similar approach, but got worried about dictionaries not being ordered. Today, I looked a bit deeper into the implementation of get_weights() and the underlying code in tf_utils, and it turns out the weights are stored as ordered dictionaries, so I'm pretty sure we're good to go with your approach. I implemented it into my code and was able to get it to work well.

@ericl The only other thing I think is worth mentioning (relevant to the self-play discussion) is the way the policies are distributed to workers. If you want to get self-play working against a distribution of previous selves (rather than just the most current one), we need to change a line of code in multi_gpu_optimizer.py (and probably in other optimizers), or create a new optimizer for self-play. I'd like to hear your thoughts on which approach would be most maintainable/preferable.

The following one-line change makes the optimizer distribute updates only to policies that are trainable. This allows you to modify the non-trainable policies as desired to push different weights to different rollout workers (as opposed to automatically replicating the local worker's weights in the rollout workers during each training call). This should enable more robust training of agents. Additionally, this change should not impact other training since it still pushes weights from the local worker to the remote workers for trainable policies.

I suggest changing the following line of code (line 130) in multi_gpu_optimizer.py: (replace the commented out line with the new line above it)

    @override(PolicyOptimizer)
    def step(self):
        with self.update_weights_timer:
            if self.workers.remote_workers():
                weights = ray.put(self.workers.local_worker().get_weights(self.workers.local_worker().policies_to_train))
                # weights = ray.put(self.workers.local_worker().get_weights())
                for e in self.workers.remote_workers():
                    e.set_weights.remote(weights)

rhefron on 21 Mar 2020

Hmm, as I understand it you want each worker to have different weights for
the same opponent policy. Instead of having the same policy with different
weights, it may work to have a set of policies that are sampled from with
the policy mapping function. That way, the weight sync code would not have
to change. The trade-off is there are be more policies to sync to workers.

Would that work or am I missing something?

On Sat, Mar 21, 2020, 1:33 PM rhefron notifications@github.com wrote:

@ChuaCheowHuan https://github.com/ChuaCheowHuan Thank you! That is
exactly what I was looking for. I initially started looking into a similar
approach, but got worried about dictionaries not being ordered. Today, I
looked a bit deeper into the implementation of get_weights() and the
underlying code in tf_utils, and it turns out the weights are stored as
ordered dictionaries, so I'm pretty sure we're good to go with your
approach. I implemented it into my code and was able to get it to work well.

@ericl https://github.com/ericl The only other thing I think is worth
mentioning (relevant to the self-play discussion) is the way the policies
are distributed to workers. If you want to get self-play working against a
distribution of previous selves (rather than just the most current one), we
need to change a line of code in multi_gpu_optimizer.py (and probably in
other optimizers), or create a new optimizer for self-play. I'd like to
hear your thoughts on which approach would be most maintainable/preferable.

The following one-line change makes the optimizer distribute updates only
to policies that are trainable. This allows you to modify the non-trainable
policies as desired to push different weights to different rollout workers
(as opposed to automatically replicating the local worker's weights in the
rollout workers during each training call). This should enable more robust
training of agents. Additionally, this change should not impact other
training since it still pushes weights from the local worker to the remote
workers for trainable policies.

I suggest changing the following line of code (line 130) in
multi_gpu_optimizer.py: (see commented out line)
@override(PolicyOptimizer)
def step(self):
    with self.update_weights_timer:
        if self.workers.remote_workers():
            weights = ray.put(self.workers.local_worker().get_weights(self.workers.local_worker().policies_to_train))
            # weights = ray.put(self.workers.local_worker().get_weights())
            for e in self.workers.remote_workers():
                e.set_weights.remote(weights)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/6669#issuecomment-602100773,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADUSQYMPNG72QTWRTISD3RIUQCHANCNFSM4KCE3YSA
.

ericl on 22 Mar 2020

Hey guys, I made changes to my script but haven't tested it out yet. I'm dependent on a colleague to make the appropriate changes to our custom environment. Will let you know results as soon as I can test it out.

WRT to the policies of the remote workers... I've been working under the assumption that the opponent will sample from a distribution of separate policies defined by the policy mapping function (see my code above) and that the only policy that actually gets updated is "policy_01" by setting the policies_to_train": ["policy_01"] flag in the config dict. In my code above I commented that line out but I think it's needed in order to ensure that the older policies' weights don't get over-written by the optimizer.

josjo80 on 22 Mar 2020

To be clear, the approach entails:

Define a trainable policy and several other non-trainable policies up front. The non-trainable policies will be the "prior selves" and we will update them as we train. Also define the sampling distribution for the non-trainable policies in the policy mapping function like @josjo80 did above.
Train until a certain metric is met (trainable policy wins greater than 60% of the time).
Update a list of "prior selves" weights that can be sampled from to update each of the non-trainable policies.
Update the weights of the non-trainable policies by sampling from the list of "prior selves" weights.
Back to step 2. Continue process until agent is satisfactorily trained.

Any additions or things I missed? Thanks!

rhefron on 22 Mar 2020

👍4

@rhefron Yes, your method is correct! The only thing I might change is step 2 - the update metric. In this paper the authors call this update metric the gating function or the curator of the menagerie of policies. In the paper they state the following:
The gating function G used in δ-uniform-self-play is fully inclusive and deterministic. After every episode, it always inserts the training policy into the menagerie. G(π o , π) = π o ∪ {π}
So, it would seem that they are constantly populating the non-trainable policies with the latest versions. I don't think that's necessary to still get good results and I only add policies every 5 or 10 training iterations. But I do like your idea in which you only add a policy to the menagerie if it has achieved some X% win-rate. Definitely something to play around with.

The only other thing I'll note is that OpenAI commented in two separate posts about sampling from the menagerie. In their Dota 5 blog post they stated, OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves.

And in their Competitive Self-Play post they stated, Our agents were overfitting by co-learning policies that were precisely tailored to counter specific opponents, but would fail when facing new ones with different characteristics. We dealt with this by pitting each agent against several different opponents rather than just one. These possible opponents come from an ensemble of policies that were trained in parallel as well as policies from earlier in the training process. Given this diversity of opponents, agents needed to learn general strategies and not just ones targeted to a specific opponent.

I'm not totally sure what their policy sampling function looked like, so I only estimated what they did by sampling 80% on the current policy and 20% divided equally among the others. The paper I mentioned at the top goes into more theory on a better sampling distribution δ-Limit Uniform policy sampling distribution.

josjo80 on 23 Mar 2020

👍3

@ericl If I could make a feature request for self-play it would be to provide some options in the run_experiments API. One might be the Gating Function value: # of iterations between adding a policy to menagerie. And the other would be an option to turn on the policy weight updates. The user then takes care of the policy sampling as is currently the case.

josjo80 on 23 Mar 2020

Success! Finally got the code up and running. Appears to work as @ChuaCheowHuan suggested.

One question I have for @ericl Does training run more slowly when iterating through trainer.train() manually vs tune.run() or the run_experiments() API? The reason I ask is because of the following statement in the documentation: Tune will schedule the trials to run in parallel on your Ray cluster. If so, does that mean that in order to do selfplay in this way we automatically take a speed hit? Or is there a way to accomplish the same thing within the other APIs?

josjo80 on 23 Mar 2020

Tune will schedule the trials to run in parallel on your Ray cluster.

There's no speed hit for one trial, that's only if you want to run a sweep across hyperparams. Note that tune also has a function-based API you can use for this sort of custom training code as well: https://ray.readthedocs.io/en/latest/tune-usage.html#tune-function-based-api

ericl on 24 Mar 2020

@ericl One more question for you on self-play. If we wanted to implement a "team spirit" reward sharing mechanism like OpenAI did for Dota2 how could we alter the returns of each agent over the course of training? For instance, if we wanted a "team spirit" variable that annealed at a certain schedule how could we do that without having to pass a variable back to the gym environment?

Ex:

agent_01_contribution = 9
agent_02_contribution = 1
team_total = 10
team_spirit = 0.5  #<-- This changes according to schedule over course of training

agent_01_reward = agent_01_contribution  + team_total * team_spirit = 9 + 10 * 0.5 = 14
agent_02_reward = agent_02_contribution  + team_total * team_spirit = 1 + 10 * 0.5 = 6

josjo80 on 24 Mar 2020

👍1

Check out the foreach_env function which can be used to update the env: https://ray.readthedocs.io/en/latest/rllib-training.html#curriculum-learning

Otherwise, if it's a fixed annealing schedule that can be hard-coded in the envs.

ericl on 24 Mar 2020

👍2

Now i've tried to pre-define the policies with total of 51 policies (50 old version + 1 learning agent). Does this make the trainer initialize 50 models in my gpus? which I don't think it can handle 50 models in the gpu.

Would it make more sense, in the case of two-player games, we only define opponent policy once?
Then for each time before we update the weights, we keep the old weights some where in the code (or dump into hdd if it gets really big) and in policy_mapping_fn we do as @ChuaCheowHuan suggest.

So the config would be:

"multiagent": {
            "policies": {
                "learning_agent": (None, obs_space, act_space, {}),
                "opponent": (None, obs_space, act_space, {})
            },
            "policy_mapping_fn": tune.function(policy_mapping_fn),
            "policies_to_train": ["learning_agent"]
        },

And the policy_mapping_fn would be:

def policy_mapping_fn(agent_id):
    if agent_id.startswith("agent_01"):
        return "policy_01" # Choose 01 policy for agent_01
    else:
        # sample and copy prior-self weights
        iteration_number = random.randint(current_iteration)
        prior_self_weight = get_weights_from_iteration(iteration_number)
        P0key_P1val = {} # temp storage with "opponent" keys & "prior_self_weights" values
        for (k,v), (k2,v2) in zip(trainer.get_policy("opponent").get_weights().items(),
                                            prior_self_weights.items()):
         P0key_P1val[k] = v2

         # set weights
         trainer.set_weights({"opponent":P0key_P1val})
         return 'opponent'

51616 on 1 Apr 2020

Hi @51616 , I use the policy_mapping_fn only for mapping agents to policies. I do not have the swap/set weights functionalities inside the policy_mapping_fn, I have them called in the on_train_result callback.

With regards to predefining policies, I suppose you have very large & deep models for each policy & would like to

"keep the old weights some where in the code"

probably with some custom data structures which will eat into your memory anyway, so why not use the ones readily available by predefining policies?

ChuaCheowHuan on 2 Apr 2020

@ChuaCheowHuan So I suspect that all the models must all be in the VRAM for the whole training period right? Which I don't think is possible if I have like hundreds versions of the model. What I proposed is create only 2 instance of the model in two-player games so that it will be more memory efficient and manageable for a single gpu. If the weights are getting too big, you can dump them then load the back as you sample the weights. Does that make sense to you? Is this a viable approach to this problem @ericl @josjo80 @rhefron ?

51616 on 2 Apr 2020

Hi, @51616 ,

"So I suspect that all the models must all be in the VRAM for the whole training period right?"

I want to agree with you because that sounds probable but I would rather believe that only those subset of policies that are currently in training are required in the VRAM. I could be wrong though.

"If the weights are getting too big, you can dump them then load the back as you sample the weights."

You could do that but wouldn't it be costly for the I/O? Sounds like a trade-off between space-time.

ChuaCheowHuan on 2 Apr 2020

Yes but I'd rather train slower than cannot train at all. I did play with the code and found that the policy_mapping_fn is distributed and called by each worker. This makes my code not runnable because I thought this would be called in the main process. Is there anyway I can force the policy_mapping_fn to be centralized? @ericl. But this might causes the weights update to be overwrite by other workers since they share the same policy.

Another way to do this is predefined opponent model for each worker. For example, if I train on 8 workers I would have opponent_1 for worker number 1, opponent_2 for worker number 2 and so on. Then do the weights sampling/update on the main process. And there will be no problem with the overwriting. I will try to run this tonight.

51616 on 2 Apr 2020

@51616 ,

"For example, if I train on 8 workers I would have opponent_1 for worker number 1, opponent_2 for worker number 2 and so on."

I might totally missed the point on your objective but would declaring i.e, k environments in each worker where each environment has say, m agents work for you?

ChuaCheowHuan on 2 Apr 2020

Hi @51616 . I haven't tried to implement self-play with so many old policies for the opponent. You may want to try a different approach if you can't get your code working. Using that many old policies implies that you want to train against a larger distribution of policies. The oldest policy would dictate the maximum variance of the action distribution that the new policy competes against. Also, the frequency of the Gating function (see paper above in my post) and the number of old policies would dictate how quickly the updated weights propagate to the older policies. Increasing the number of old policies, to say 50, and updating weights frequently, say every 5 iterations, would suggest that your mapping function will sample from a large population and ultimately reduce the variance in the episode returns from one iteration to the next. In other words, your training will be smooth. Conversely, reducing the number of old policies, to say 5, and reducing the weight update frequency, to say 50, will result in sampling from a smaller population and more variance in your episodic returns. In both cases, you can sample from a population of policies that span back to the same generation. My guess is that there's a happy medium somewhere in between.

josjo80 on 2 Apr 2020

@51616 , I think what you're proposing is possible with something like the following:

define a single opponent policy
in the learner, keep references to a large number policies on shared storage (such as disk, or ray.put() into the object store)
in a callback on the learner (e.g., on_train_result), send to each worker the list of references using (e.g., trainer.workers.foreach_worker)
in the policy mapping function on the worker, load weights from a random old policy into the opponent policy

This way, only two sets of policy weights need to be in GPU vRAM at the same time.

That said, I would explore trying to keep them in vRAM and seeing how far that can go with perhaps some middle ground as @josjo80 suggests. I believe with typical PPO batch sizes most GPU memory usage will be for storing activations and not the actual weights, so it could fit a surprising number of policies as long as they aren't being used.

ericl on 2 Apr 2020

@ericl Thanks for the feedback. What about the exploration behaviour of each policy? Do they share the same exploration parameters?

51616 on 3 Apr 2020

Hi @josjo80 @ericl I have a similar project to you, where I want to implement self-play using PPO for a 4 vs 4 games. I notice that your code for policy mapping fn will work if it is for 1 vs 1 game, but if I try to use it 4 vs 4, there is a chance that the opposite side will use 4 different policy, as the random choice will be called on each agent (which mean there is a chance that the agent on opponent side will use different policy). I suppose that the ideal condition would be all the opponent agent using the same policy that were choosen from the random choice (CMIIW). Do you have any idea about the workaround for this? Thank you so much!

Nicholaz99 on 30 Apr 2020

Hi @josjo80 @ericl I have a similar project to you, where I want to implement self-play using PPO for a 4 vs 4 games. I notice that your code for policy mapping fn will work if it is for 1 vs 1 game, but if I try to use it 4 vs 4, there is a chance that the opposite side will use 4 different policy, as the random choice will be called on each agent (which mean there is a chance that the agent on opponent side will use different policy). I suppose that the ideal condition would be all the opponent agent using the same policy that were choosen from the random choice (CMIIW). Do you have any idea about the workaround for this? Thank you so much!

I think replace the policy switching mechanism in on_train_result callback instead of the policy mapping function is a better idea since on train result take the trainer as input and you can do weight sync after choosing a proper random opponent.

pengzhenghao on 12 May 2020

👍1

@Nicholaz99 In the case of mapping a single random policy to a group of agents, I think you may need to implement the team as a group using the with_agents_group() method: https://docs.ray.io/en/latest/rllib-env.html?highlight=group#grouping-agents . If you have a 4-player team as group_01 and the opposing team as group_02, then you could do something like the following:

######Policy mapping to agents
    if agent_id.endswith("_01"):
        return "policy_01" # Choose 01 policy for 01 platform
    else:
        return np.random.choice(["policy_01", "policy_02", "policy_03", "policy_04"],1,
                                p=[.8, .2/3, .2/3, .2/3])[0]

I believe this should choose the same policy for everyone in the group. @ericl can you confirm?

josjo80 on 6 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings