Ml-agents: gym_unity seems to provide a reward of 0.0 for the final step

Created on 17 Feb 2020  路  3Comments  路  Source: Unity-Technologies/ml-agents

When using gym_unity to interact with a built Unity environment, the reward obtained in the env.step(action) at the last timestep (when done = True) seems to be 0.
Tried this on the Basic and 3DBall environment, where both environment should produce a reward different that 0 at the last timestep.

To reproduce the bug:

  1. Build the 'Basic' environment in Unity
  2. Run the following code :
import numpy as np
from gym_unity.envs import UnityEnv

env = UnityEnv("../../../ml-agents/envs/buildbasic/Basic", 5, no_graphics=False, flatten_branched=False)

for e in range(2):
    print("Episode ", e)
    o, d = env.reset(), False

    while not d:
        o, r, d, _ = env.step(np.array([2]))
        print(o, r, d)

env.close()
  1. Output :
Episode  0
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 0.0 True
Episode  1
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 0.0 True

The agent should receive a reward of +1 at the last timestep for reaching the big goal (as here it is always taking the action 2, corresponding to the action of going to the left). But we can see that it receives instead a reward of 0 at the last timestep.

Environment:

  • OS + version: Ubuntu 18.04.4
  • _ML-Agents version_: ML-Agents v0.14.0
  • _Environment_: Basic, 3DBall

Thank you.

bug

All 3 comments

It seems to be a specific problem of gym_unity, as the reward is not 0 when interacting with the environement with mlagents_envs.
Code :

import matplotlib.pyplot as plt
import numpy as np
import sys

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfig, EngineConfigurationChannel

engine_configuration_channel = EngineConfigurationChannel()
env = UnityEnvironment(base_port = UnityEnvironment.DEFAULT_EDITOR_PORT, worker_id=7, file_name="../../../ml-agents/envs/buildbasic/Basic", side_channels = [engine_configuration_channel])

#Reset the environment
env.reset()

# Set the default brain to work with
group_name = env.get_agent_groups()[0]
group_spec = env.get_agent_group_spec(group_name)

# Set the time scale of the engine
engine_configuration_channel.set_configuration_parameters(time_scale = 1.0)

for episode in range(2):
    env.reset()
    step_result = env.get_step_result(group_name)
    done = False
    episode_rewards = 0
    while not done:
        action = np.array([2]).reshape(1, 1)

        env.set_actions(group_name, action)
        env.step()
        step_result = env.get_step_result(group_name)

        episode_rewards += step_result.reward[0]
        done = step_result.done[0]
        print(done, step_result.reward[0])

    print("Total reward this episode: {}".format(episode_rewards))
env.close()

Output :

False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
True 0.99
Total reward this episode: 0.9300000108778477
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
True 0.99
Total reward this episode: 0.9300000108778477

As you can see, we observe a reward of +1 at the last timestep (0.99 exactly because we receive -0.1 per timestep)

Hi @Procuste34 -- thanks for the bug report and repro steps. I'll share this issue with the team.

This issue was fixed in #3471, referenced here. You can try it out in the latest v0.14.1 release. I'm going to close this issue report, but please feel free to reopen if you continue to have problems.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Porigon45 picture Porigon45  路  3Comments

green4you picture green4you  路  4Comments

mattinjersey picture mattinjersey  路  3Comments

MarcPilgaard picture MarcPilgaard  路  3Comments

MarkTension picture MarkTension  路  3Comments