Stable-baselines: [question] Deterministic results with timeserie-based observations

Created on 19 Apr 2020  路  4Comments  路  Source: hill-a/stable-baselines

Hello all,
I had the chance to say it already but wouldn't miss a chance to say it again: thank you to the authors, maintainers and all contributors of this library, truly remarkable work.

As I keep learning thanks to this library applied to a personal project of mine, I have a challenge that same of you might also have experienced. I didn't find any answer related to it while searching for it, thus the submission of this new issue.

Context
Inspired by RL Zoo examples, I'm training DDPG models on a custom gym environment of mine.
I have two separate environments, one for learning, one for evaluation.
Both environments are:

  • Based on a timeserie
  • From an observation standpoint, one step number = one specific hour in the timeframe
  • The observation is only for the current hour of the step (no "observation loopback")

The only difference between these environments is that:

  • The learning environment is from 2017 to Dec 31 2019 11 PM
  • The evaluation environment is from Jan 1 2020 00 AM ( let's call this first hour of the timeserie : T0 ) to the very last complete hour at run time, so somewhere in April 2020 depending on when you read this ( let's call this: T(n) )

My action space is spaces.Discrete(3)

Steps already taken to have deterministic results

  • Using Stable Baselines 2.10.1a0
  • "n_cpu_tf_sess = 1" passed as parameter when creating a model
  • "deterministic=True" passed as parameter when using "model.predict()"
  • Using CPUs only ("USE_GPU=False" when building the RL Zoo Docker image this project of mine is running from) to avoid any Tensorflow GPU bias

If I load a pre-trained model and run it multiple times over the same evaluation environment going from T0 to T(n) ('n' being a fixed number of hours in that case, as I freeze the evaluation environment in time), I get the exact same results/same actions predicted at each step for all runs of the model, VERY GOOD 馃憤 (I think I achieved "deterministic" behavior here at this stage)

Question
Now, as time passes by, new hours are now completing the evaluation environment and its timeserie-based data. So, let's say I now have data from T0 to T(n+10) - so 10 more hours than before - and I ask the model its actions for each hour over this whole new range, from T0 to T(n+10), which of course includes T(n):

The actions I get are now different for the last 20 hours or so. As in, even actions at T(n), or even slightly before T(n), at T(n-1) for instance, are now different than the actions suggested when the model was only seeing data up to T(n) 馃憥

**Again, my observations are each only for the current hour, but my batch size is 128, is it influencing anything? Is there any kind of sliding window the model is looking at that influences its predicted actions when seeing new data/new hours ?

Can I somehow enforce these windows so at least what was predicted at or before T(n) is still the same even if the models now has access to a timeserie beyond T(n) ?**

(I hope I got my context & question(s) clear, I can draw an illustration if that makes it easier :) )

Wishing everyone a good start of this new upcoming week, stay safe and learn ML ! ;)
Thank you in advance to anyone who has experience to share here and discuss with me :)

custom gym env question

All 4 comments

Thanks for the kind words :).

Let me recap a little: You train the model on fixed data D, and then you test out that the actions are always same for all samples in D. You then gather few extra samples (the +10 hours), append that to D, and _predict_ actions for this new data. Now the actions for couple of the hours is different.

The key point above is the "_predict_": Did you train a new model on the new data, or only use the previous trained model? If you always used a fixed model without any training (with `predict麓), then this should not be happening and we might have a bug somewhere. If you trained a new model with this new data, then this behaviour is somewhat expected, as with different data you will get different results. There is no "sliding" going on, and batch-size should not matter (for prediction).

Sidenote: All those steps for "full" determinism is for having consistent results between different training runs. If you only need same actions for same observations, then deterministic=True is generally enough.

Thanks @Miffyli for the prompt feedbacks!

You are right in your recap, except that, when gathering few extra samples (the +10 hours), even actions predicted for some datapoint in D also change

Sorry for the confusion, of course I wouldn't expect any kind of reproducibility in the +10 hours. My question is more like: given results in each hour of D, and now I add 10 hours to it, can I get new results for the 10 hours WHILE keeping the same results in each hour of D?

This very last part "WHILE keeping the same results in each hour of D" is my challenge as of now :(

PS : the trained model is always the same, I have a ZIP of it and only use "load" and "predict" out of it

PS PS: yes thank you for the sidenote, I actually want to achieve that so that my hyper-parameter tunings attempts with Optuna are relevant ( as in, if Optuna would decide to try twice the same set of hyper parameter, it SHOULD lead to the same result for the trials )

This very last part "WHILE keeping the same results in each hour of D" is my challenge as of now :(

This sounds more like a RL algorithm / design problem. One option would be to use different agents for different parts of data, if that is what your scenario allows. I do not think there is any general solution for that kind of problem. If the data you train on changes, or even just has new datapoints for different observations, it is not surprising all predictions change. Another thing you could try is to include past observations in your predictions, or use LSTM policies.

These issues are not generally for tech support, and I suggest you ask for help on e.g. reddit /r/reinforcementlearning or stack overflow. You can close this issue if you do not have further issues/bugs to raise related specifically to stable-baselines.

Thank you again @Miffyli , I will implement some watchdogs to make sure the data remains the same when training and evaluating. I just realized, since I'm sourcing some part of the data from an external API, that I shouldn't trust the past data should remain the same at each pulling of the data (even though it should)

I will also try your different suggestions

If that's OK with everyone, I will close the issue in couple days to give a chance for more people to see the thread in case there are more feedback or if there is anything we would have missed, thanks again !!

Was this page helpful?
0 / 5 - 0 ratings