In some algorithms like R-Max it is necessary to know the maximum reward given at any _single step_ by the environment. Just like continuous state environments provide ranges for each dimension, I suggest that they should provide reward ranges too. Note that it is not the _accumulated reward_ range, but _single step_ reward range.
Not a bad idea.
would it make sense for that to be exposed as env.reward_space ? to follow suit with env.action_space and env.observation_space which give shape and bound data around actions and observations
Yep, that sounds right to me.
John, what do you think of doing a 1-D box which we extend to also allow \inf and -\inf as bounds?
I'm more in favor of calling it reward_range, which defaults to (-np.inf, np.inf)
Ok yeah, that sounds better. Don't think we want to imply that it can be more than a scalar.
Do you want those just to be bounds, or could it be useful to know that your rewards will be e.g. a discrete set like {-1, 1}?
I can't think of any reason why one would care that it's a discrete set.
Note the Atari environments (and perhaps others) are still at the default (-inf, inf) range and do not declare the specific range of each ROM.
Most helpful comment
I'm more in favor of calling it reward_range, which defaults to (-np.inf, np.inf)