When using gym_unity to interact with a built Unity environment, the reward obtained in the env.step(action) at the last timestep (when done = True) seems to be 0.
Tried this on the Basic and 3DBall environment, where both environment should produce a reward different that 0 at the last timestep.
import numpy as np
from gym_unity.envs import UnityEnv
env = UnityEnv("../../../ml-agents/envs/buildbasic/Basic", 5, no_graphics=False, flatten_branched=False)
for e in range(2):
print("Episode ", e)
o, d = env.reset(), False
while not d:
o, r, d, _ = env.step(np.array([2]))
print(o, r, d)
env.close()
Episode 0
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 0.0 True
Episode 1
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] -0.01 False
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 0.0 True
The agent should receive a reward of +1 at the last timestep for reaching the big goal (as here it is always taking the action 2, corresponding to the action of going to the left). But we can see that it receives instead a reward of 0 at the last timestep.
Thank you.
It seems to be a specific problem of gym_unity, as the reward is not 0 when interacting with the environement with mlagents_envs.
Code :
import matplotlib.pyplot as plt
import numpy as np
import sys
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfig, EngineConfigurationChannel
engine_configuration_channel = EngineConfigurationChannel()
env = UnityEnvironment(base_port = UnityEnvironment.DEFAULT_EDITOR_PORT, worker_id=7, file_name="../../../ml-agents/envs/buildbasic/Basic", side_channels = [engine_configuration_channel])
#Reset the environment
env.reset()
# Set the default brain to work with
group_name = env.get_agent_groups()[0]
group_spec = env.get_agent_group_spec(group_name)
# Set the time scale of the engine
engine_configuration_channel.set_configuration_parameters(time_scale = 1.0)
for episode in range(2):
env.reset()
step_result = env.get_step_result(group_name)
done = False
episode_rewards = 0
while not done:
action = np.array([2]).reshape(1, 1)
env.set_actions(group_name, action)
env.step()
step_result = env.get_step_result(group_name)
episode_rewards += step_result.reward[0]
done = step_result.done[0]
print(done, step_result.reward[0])
print("Total reward this episode: {}".format(episode_rewards))
env.close()
Output :
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
True 0.99
Total reward this episode: 0.9300000108778477
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
False -0.01
True 0.99
Total reward this episode: 0.9300000108778477
As you can see, we observe a reward of +1 at the last timestep (0.99 exactly because we receive -0.1 per timestep)
Hi @Procuste34 -- thanks for the bug report and repro steps. I'll share this issue with the team.
This issue was fixed in #3471, referenced here. You can try it out in the latest v0.14.1 release. I'm going to close this issue report, but please feel free to reopen if you continue to have problems.