Ray: [rllib] Continuous instead of episodic problem

Created on 28 Jul 2020 · 7Comments · Source: ray-project/ray

If I have a continuous instead of episodic problem, would I have to change anything in RLlib to address it? Or just ensure that my environment always returns done=False?

In chapter 10.3 of Sutton's book, I read that the average reward should be used for continuous problems instead of the discounted return. How would I achieve that? Or do I just have to set discount factor gamma to 0?

question

Source

stefanbschneider

All 7 comments

Great question @stefanbschneider , haven't thought about this myself (@ericl?). Yes, the PG loss function would even slightly change in this case (in other words, we don't support this atm). It shouldn't be too hard to fix, though. Which algo are you using (some PG, I assume)? For PPO, for example, would could probably change the postprocess_trajectory method to calculate the average (over n timesteps) instead of the sum of rewards and also make sure that your value function learns to predict the average (not the discounted return till the end of the episode). Gamma should be 1.0 (no, not 0.0!).

sven1977 on 28 Jul 2020

👍1

I think you can use soft_horizon for this, that's intended for use with true MDP problems.

ericl on 28 Jul 2020

Thanks for the prompt response!

@sven1977 Yes, I'm using PPO. That would be great if you could provide further details how to achieve the desired behavior. Not coming from an RL background, I'm a bit overwhelmed and don't know where/how to start.
If I understand Sutton correctly, he argues in chapter 10.4 that no discounting should be used - hence, I considered setting gamma to 0.

@ericl Hm. The docs say

# Calculate rewards but don't reset the environment when the horizon is
# hit. This allows value estimation and RNN state to span across logical
# episodes denoted by horizon. This only has an effect if horizon != inf.
"soft_horizon": False,

I'm not sure if that really matches my problem since I don't have logical episodes. Or how would I use that? To what value would I set horizon? Wouldn't that be the same as choosing arbitrary episode lengths for my continuous problem?
Btw, I'm using the default MLP model for PPO at the moment; so no RNN.

stefanbschneider on 28 Jul 2020

What would happen, if I just always return done = False and otherwise keep my environment and PPO configuration unchanged? Is that generally a bad idea because training/evaluation is somehow tied to episodes?

stefanbschneider on 29 Jul 2020

👍1

What would happen, if I just always return done = False and otherwise keep my environment and PPO configuration unchanged? Is that generally a bad idea because training/evaluation is somehow tied to episodes?

Did you solve the problem ?

DanielWicz on 16 Nov 2020

I'm not sure. I set done = False and soft_horizon = True and the environment runs continuously without resets as I want it.
Until now, I didn't change anything else, eg, regarding the calculation of average reward instead of discounted return.
The agent still learns reasonable behavior but it seems to work a bit worse compared to the episodic setting.

stefanbschneider on 17 Nov 2020

@stefanbschneider Thanks for your answer

Great question @stefanbschneider , haven't thought about this myself (@ericl?). Yes, the PG loss function would even slightly change in this case (in other words, we don't support this atm). It shouldn't be too hard to fix, though. Which algo are you using (some PG, I assume)? For PPO, for example, would could probably change the postprocess_trajectory method to calculate the average (over n timesteps) instead of the sum of rewards and also make sure that your value function learns to predict the average (not the discounted return till the end of the episode). Gamma should be 1.0 (no, not 0.0!).

@sven1977 or @ericl
There's an example how to use postprocess_trajectory() method ? Because, the discounted reward may or may not work depending on how close to 1 is gamma and how long is the continued trajectory. Therefore it would really appreciated, to see average reward and value function based on it.

DanielWicz on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings