Ray: rllib: Using gym.RewardWrapper around MultiAgentEnv cause observation mismatch with observation_space

Created on 4 May 2020 · 3Comments · Source: ray-project/ray

system config

python version: 3.6.2
Ray version: 0.8.4
Tensorflow version: 1.14.0
OS: CentOS 7
gym version: 0.17.1

problem

I was trying to use customize my multi-agent env with some reward shaping using gym.RewardWrapper, and it gives me error

ValueError: ('Observation outside expected value range', Tuple(Discrete(2), Discrete(2)), {'agent_0': (0, 0), 'agent_1': (0, 0)})

Here is a minimal example

import numpy as np
import ray
from gym.spaces import Discrete, Tuple, Dict
from ray.rllib.env import MultiAgentEnv
import gym
import numpy as np
from ray.tune.registry import register_env
import ray.rllib.agents.a3c as a3c
config = {
    'env_config': {
        'num_agents': 2,
    },
    'env': 'NPD',
    'num_workers': 2,
    'use_pytorch': False,
    'train_batch_size': 200,
    'rollout_fragment_length': 200,
    'lr': 0.0001
}
class Example(MultiAgentEnv):

    def __init__(self, num_agents=2, max_steps=100):

        super(Example, self).__init__()
        self.reward_range = (-np.inf, np.inf)
        self.metadata = {'render.modes': []}
        self.num_agents = num_agents
        self.players = [ 'agent_' + str(i) for i in range(num_agents)]
        self.action_space = Discrete(2)
        self.observation_space = Tuple(tuple(Discrete(2) for _ in range(num_agents)))
        self.current_step = None
        self.max_steps = max_steps

    def reset(self):
        self.current_step = 0
        observation = [tuple(0 for _ in range(self.num_agents)) for _ in range(self.num_agents)]
        return dict(zip(self.players, observation))

    def step(self, actions_dict):
        actions = np.array(list(actions_dict.values())).flatten()
        reward = np.array([np.random.random() for _ in range(self.num_agents)])
        reward = dict(zip(self.players, reward))
        observation = dict(zip(self.players, [tuple(actions) for i in range(self.num_agents)]))
        self.current_step += 1
        done = {'__all__': self.current_step == self.max_steps}

        info = dict(zip(self.players, [{}]*self.num_agents))
        return observation, reward, done, info
class Rwrapper(gym.RewardWrapper):
    def __init__(self, env):
        self.reward_range = env.reward_range
        super(gym.RewardWrapper, self).__init__(env)
    def reward(self, reward):
        return reward
register_env('NPD', lambda env_config: Rwrapper(Example(**env_config)))
ray.init(num_cpus=4)
trainer = a3c.A3CTrainer(config = config, env = 'NPD')
trainer.train()

Running this above code block give me the following error

And if I substitute register_env line with the following line (get rid of reward wrapper)

register_env('NPD', lambda env_config: Example(**env_config))

The code executes successfully.
I wonder if anyone can help me with the issue
Thank you very much

bug rllib

Source

0luhancheng0

All 3 comments

@0luhancheng0 thanks for filing this issue.
I found the problem. When you wrap your Env, our code does not recognize it as a MultiAgentEnv anymore and screws up the observation, which is then not recognized by the preprocessor. Will fix it and PR later today.

sven1977 on 4 May 2020

🎉1

Here is the PR that fixes the problem (test with your script passed). I'm closing this issue, but feel free to re-open it should the problem persist on your end.
Thanks again for letting us know about this!

https://github.com/ray-project/ray/pull/8314

sven1977 on 4 May 2020

👍1

Hey @0luhancheng0 . I'm very sorry, but we have decided that we don't want to add support for wrapping MultiAgentEnvs (which are not gym.Env children) with gym.Wrappers. In order to handle reward shaping for our MultiAgentEnvs, could you simply write your own custom wrapper or post processing function that takes the rewards and manipulates them?
Again, sorry for the inconvenience, but we don't want to add support for something that would become harder to maintain at some point (b/c the MultiAgentEnv API may evolve over time and thus away from the gym.Env API).

sven1977 on 8 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings