Ray: [ray][rllib] How to use PBT in multi-agent training?

Created on 15 May 2020  路  11Comments  路  Source: ray-project/ray

In the game there are 4 agents divided to groups of 2.
I have 2 groups playing against each other. I would like PBT to maximize the "policy_reward_mean" metric of each group/agent. I want each team to have a different hyper-parameter tuning.
For example, if Blue team mostly wins, I would expect PBT to drop the entropy_coeff to the blue team but increase it to the Red team.

How can it be done?
Can you show me an example code please?
I thought maybe putting the pbt scheduler in the policy config.

Thank you.

question

All 11 comments

I started with RLLIB recently, so I am not an expert. From my understanding, you have to implement your custom Trainable class that wraps the RL agent (trainer object).

Updating the parameters happens in the reset_config method of your trainable class. Here you have to dig a little bit into the code and see at which place the parameters that you want to perturb occur. A hot candidate probably is trainer.workers.local_worker().policy_map, which is a dictionary that holds all policies that you are training.

How you update the parameters depends a bit. If the parameter is just a scalar value that is used throughout Rllib, then you can probably simply assign the new value. If the parameter is passed into the compute graph of the policy (e.g., the learning rate of Adam), then you have to update the respective variable in the graph, which, of course, depends on the framework you are using.

In case of tensorflow, it is important that the corresponding value is implemented as a variable and not a tensor. From my understanding, you cannot update a tensor. A variable works, though. So in case the parameter you want to update is not a variable, you might have to tweak the creation process of the policy.

Here is an example of how I implemented a Trainable that wraps a DQN trainer and updates it learning schedule (not guarantee, though):

class RlTrainable(Trainable):

    def _setup(self, config):
        self.trainer = set_up_trainer()

    def _train(self):
        return self.trainer.train()

    def _save(self, tmp_checkpoint_dir):
        return self.trainer._save(tmp_checkpoint_dir)

    def _restore(self, checkpoint):
        self.trainer._restore(checkpoint)

    def export_model(self, export_formats, export_dir=None):
        self.trainer.export_model(export_formats, export_dir)

    def reset_config(self, new_config):
        if "lr" in new_config:
            cur_lr = self.trainer.workers.local_worker().policy_map['all-jobs'].cur_lr
            sess = self.trainer.workers.local_worker().policy_map['all-jobs']._sess
            cur_lr.load(
                new_config['lr'],
                session=sess
            )
        self.config = new_config
        return True

I hope it helps you.

Hi @omri-axon-vision, @PaddyK, I'm new to ray as well. I managed to find a Humanoid-v1 example but I'm not sure if the current PBT scheduler will work with a MARL environment. Can someone please confirm this? Thanks!

As pointed out by @PaddyK, access to a trainer object will grant access to the agent's policy & that's where one can easily do weight sync or access it's hyperparameters. Beside implementing your own Trainable, I think another option is to access the trainer in the on_train_resultcallback (line 152).

As I'm unsure if the current PBT scheduler will work with a MARL environment, I decided to try implementing a really bare bones version referencing this paper. If anyone is interested, here's my repo.

Hope that helps.

@ChuaCheowHuan I don't know if your implementation is working in RLlib since your PBT_MARL class is changing the config of the trainer but as I understand the loss calculation graph is defined upfront at the initialization of the trainer. Hence, changing the config doesn't really change the loss (e.g., learning rate, gamma, vf_coeff, clip_value). @ericl can you verified this?

I'm also trying to implement PBT in MARL. One solution I can think of is having individual trainer for each policy in the population and re-initialize the trainer every time it inherits + mutate current hyperparameters. This maybe not optimal since it requires additional space and time for each trainer to kept in memory and re-initialization of a trainer might be expensive. Any thoughts?

Hi @51616, You are right. I could have overlooked that. Your proposed solution sounds good. Thanks for having a look at my code.

EDIT: The issue has been fixed in my repo.

@ChuaCheowHuan Actually, what i suggested probably doesn't work since we need the agents to play together which, I believe, requires the policies to be in the same trainer instance. I don't know if there is a way we can use the same experience to train different policies with different hyperparameters in RLlib.

Hi @51616, I just read the first comment of this somewhat different but related issue #5753.

You have to create the set of valid policies up-front at trainer initialization. The configuration dict is replicated in the cluster so it's not something you can just mutate in your code.

So I guess this validates your concern about changing the config.

If you really need to change the configuration at runtime, a possible workaround is to checkpoint the trainer with save(), destroy it, create a new trainer with the updated configuration, and restore the checkpoint with restore().

I would like to try the checkpointing workaround (though it's mentioned that this would be rather expensive) with a single trainer (for multiple policies) with the multiagent dictionary.

I would appreciate if anyone would shed some light on this. Thank you.

@51616 Following the suggestions in #5609 #8827 #9012, it seems possible to change hyperparameters during runtime.

Possible solution:

# In training loop:
new_config = trainer.get_config()
new_config = make some changes
save_path = trainer.save(local_dir)
trainer.stop()
trainer = ppo.PPOTrainer(config=new_config, env="RockPaperScissorsEnv")
trainer.restore(save_path)

If anyone's interested, I've piece together some simple test code, available in this gist.

@ChuaCheowHuan Another approach is to make use of mixins class but this requires modification in the policy. I believe this is a more efficient way to go about implementing PBT in MARL since you dont have to re-initialize the trainer every time which could be very expensive especially in the case where there are multiple policies in the trainer.

class PBTParamsMixin:
    def __init__(self, obs_space, action_space, config):
        self.cur_lr_val = config['lr']
        self.cur_lr = tf.get_variable(initializer=tf.constant_initializer(self.cur_lr_val),
            name='cur_lr', shape=(), trainable=False, dtype=tf.float32)

    def update_val(self, update_dict):
        out_dict = {}
        for key,value in update_dict.items():
            if (key[-4:]=='_val') and (key in self.__dict__):
                setattr(self,key,value)
                out_dict[key] = getattr(self,key)
                self.__dict__[key[:-4]].load(getattr(self,key), session=self.get_session())
            else:
                raise KeyError(f'Unknown update key: {key} is not in this policy class')
        return out_dict

def setup_mixins(policy, obs_space, action_space, config):
    PBTParamsMixin.__init__(policy, obs_space, action_space, config)

CustomPolicy = PPOTFPolicy.with_updates(
    name = "MyCustomPPOTFPolicy",
    before_init = setup_mixins
    mixins = [PBTParamsMixin])

CustomTrainer = PPOTrainer.with_updates(
    default_policy=CustomPolicy)

Then in callbacks you can do something like this

policy = trainer.workers.local_worker().get_policy('MyCustomPPOTFPolicy')
update_dict = {'lr':0.001}
policy.update_val(update_dict)
trainer.workers.for_policy(lambda pi,pi_id:pi.update_val(update_dict),'MyCustomPPOTFPolicy')

@51616 Nice, I like your suggested approach.

https://github.com/ray-project/ray/blob/b4c527b3f3e61bbea72b15706a9958fd337c93a6/rllib/examples/centralized_critic.py#L177

The example above modifies a policy with the following mixin, changing the learning rate.

https://github.com/ray-project/ray/blob/b4c527b3f3e61bbea72b15706a9958fd337c93a6/rllib/policy/tf_policy.py#L814

I'll be trying your suggestion. Thank you.

@51616

The following two lines helps me change hyperparameters lr during runtime without the need to modify policies with mixin.
In fact, the required schedules mixins are already included in PPO. Related: #9872 #9929

# Changing lr with a random number during runtime.

agt_0_pol = trainer.get_policy('agt_0')
agt_0_pol.lr_schedule = ConstantSchedule(np.random.rand(), framework=None)   

@ChuaCheowHuan Yeah, but modifying Mixin gives you more freedom to change other hyperparams too. Your code would be more convenient if only changing hparams is the learning rate.

Was this page helpful?
0 / 5 - 0 ratings