If I have a continuous instead of episodic problem, would I have to change anything in RLlib to address it? Or just ensure that my environment always returns done=False?
In chapter 10.3 of Sutton's book, I read that the average reward should be used for continuous problems instead of the discounted return. How would I achieve that? Or do I just have to set discount factor gamma to 0?
Great question @stefanbschneider , haven't thought about this myself (@ericl?). Yes, the PG loss function would even slightly change in this case (in other words, we don't support this atm). It shouldn't be too hard to fix, though. Which algo are you using (some PG, I assume)? For PPO, for example, would could probably change the postprocess_trajectory method to calculate the average (over n timesteps) instead of the sum of rewards and also make sure that your value function learns to predict the average (not the discounted return till the end of the episode). Gamma should be 1.0 (no, not 0.0!).
I think you can use soft_horizon for this, that's intended for use with true MDP problems.
Thanks for the prompt response!
@sven1977 Yes, I'm using PPO. That would be great if you could provide further details how to achieve the desired behavior. Not coming from an RL background, I'm a bit overwhelmed and don't know where/how to start.
If I understand Sutton correctly, he argues in chapter 10.4 that no discounting should be used - hence, I considered setting gamma to 0.
@ericl Hm. The docs say
# Calculate rewards but don't reset the environment when the horizon is
# hit. This allows value estimation and RNN state to span across logical
# episodes denoted by horizon. This only has an effect if horizon != inf.
"soft_horizon": False,
I'm not sure if that really matches my problem since I don't have logical episodes. Or how would I use that? To what value would I set horizon? Wouldn't that be the same as choosing arbitrary episode lengths for my continuous problem?
Btw, I'm using the default MLP model for PPO at the moment; so no RNN.
What would happen, if I just always return done = False and otherwise keep my environment and PPO configuration unchanged? Is that generally a bad idea because training/evaluation is somehow tied to episodes?
What would happen, if I just always return
done = Falseand otherwise keep my environment and PPO configuration unchanged? Is that generally a bad idea because training/evaluation is somehow tied to episodes?
Did you solve the problem ?
I'm not sure. I set done = False and soft_horizon = True and the environment runs continuously without resets as I want it.
Until now, I didn't change anything else, eg, regarding the calculation of average reward instead of discounted return.
The agent still learns reasonable behavior but it seems to work a bit worse compared to the episodic setting.
@stefanbschneider Thanks for your answer
Great question @stefanbschneider , haven't thought about this myself (@ericl?). Yes, the PG loss function would even slightly change in this case (in other words, we don't support this atm). It shouldn't be too hard to fix, though. Which algo are you using (some PG, I assume)? For PPO, for example, would could probably change the postprocess_trajectory method to calculate the average (over n timesteps) instead of the sum of rewards and also make sure that your value function learns to predict the average (not the discounted return till the end of the episode). Gamma should be 1.0 (no, not 0.0!).
@sven1977 or @ericl
There's an example how to use postprocess_trajectory() method ? Because, the discounted reward may or may not work depending on how close to 1 is gamma and how long is the continued trajectory. Therefore it would really appreciated, to see average reward and value function based on it.