@bletham I'm seeing different forecast values in uncertainty intervals in each run. Hence different users are unable to replicate the same uncertainty interval forecast (even with the same variables, changepoint.prior.scale & interval.width) Could you please help understand the reason for this & how this can be tackled. Please let me know If the issue is unclear to you.
@ruksarsultana,
short answer: I think there is randomness in the optimization procedure but the good news is that you can control for it so that every user gets the same result by setting the seed. Just put set.seed(1234) at the top of your code and execute the whole code (if you are using R) or
import random
random.seed(1234)
if you are using Python. You don't have to use 1234, you can use any sequence of numbers there.
Hope this helps!
long answer:
I think that must be down be down to some randomness in the inferential procedure (Prophet uses Stan's penalized maximum likelihood estimation with optimization by default, and MCMC if you ask it to).
I have to read up a little bit the Stan's documentation on the penalized ml estimation method before I can narrow down the answer to it (been a while since I used any Bayesian analysis). I will update my answer then accordingly. If you were using the full posterior MCMC then it's clear that you would have randomness. EDIT: See my update below.
Now to the point of "how to fix this" - it is very easy. R (and Python) allow for the random number generator to be initiated from a fixed, known state. In this way you always get the same random numbers and results are reproducible. In fact, I will always recommend that you do that in any type of research work or analysis that involves random number generation. Just before you run the prophet analysis, just use set.seed(1234) in Ror set the seed in Python using random.seed(1234)from the randommodule and run the whole code. Make sure that all the users do the same (they have to use the same seed).
Check out this topic on stackoverflow for a bit more info: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
EDIT UPDATE: OK, here is what I think happens. When you run prophet, by default you do inference using penalized ML which in turn uses an LFBGS optimization algorithm.
At some point for the L-FBGS optimization procedure, Prophet calls the stanfunction, and in stanone of the arguments that you have is a choice on how you initialize the starting parameters for the optimization (the argument is called init. I presume that Prophet takes the default value, which is to initialize the starting parameters randomly - and this is where you randomness in the output comes from. You can take a look here for more details: https://mc-stan.org/docs/2_18/stan-users-guide/efficiency-for-probabilistic-models-and-algorithms.html
Not sure if I am right, @bletham might confirm or correct me.
@APramov thanks for the tips! I'll add a few things.
It is correct that by default the fitting uses a MAP estimate. Stan by default uses random initialization, but Prophet actually does not - it uses a fixed initialization for the fitting:
https://github.com/facebook/prophet/blob/190d3239fd2172c9bfcedd57b7cdefde56108bf8/python/fbprophet/forecaster.py#L1092
So the optimization is in principle deterministic. I say in principle because L-BFGS optimization is not actually totally deterministic, it depends on things like machine precision and the linear algebra packages installed on the system. My experience is that if I do the fitting on a single computer multiple times it will get the same result, but there are several issues on the github here that report examples where fitting the same time series on different machines has produced slightly different results due purely to differences in the L-BFGS optimization. These differences are always very small, but they can be present and unfortunately I do not think there is anything that can be done about them.
Though it sounds like the question here is mostly about uncertainty intervals, and for that there is a much large source of variability across runs. Trend uncertainty is estimated using Monte Carlo sampling from the trend change generative model. This is by default done with 1000 samples, and then the e.g. 80% intervals are computed by taking the 10th and 90th quantiles from these samples. I would expect differences in the uncertainty intervals across runs to primarily be MC variance from this sampling. You can tell if this is the case by calling predict twice on the same fitted model; any difference in predictions is 100% due to the MC estimate of trend uncertainty. There are two things that can be done. When instantiating the Prophet object, there is an argument uncertainty_samples that specifies how many MC samples to do. Increasing this will reduce the MC variance, though at a cost of taking more time to run predict because it has to do more simulations. You might have to increase it very high to get things to be totally repeatable. The other approach would be to set a random seed, as @APramov suggests. In Python, all of the random sampling is done in numpy so you'll have to set the seed in numpy:
import numpy as np
np.random.seed(1234)
So the optimization is in principle deterministic. I say in principle because L-BFGS optimization is not actually totally deterministic, it depends on things like machine precision and the linear algebra packages installed on the system. My experience is that if I do the fitting on a single computer multiple times it will get the same result, but there are several issues on the github here that report examples where fitting the same time series on different machines has produced slightly different results due purely to differences in the L-BFGS optimization. These differences are always very small, but they can be present and unfortunately I do not think there is anything that can be done about them.
I've seen differences in the MAP estimation up to 12% between different OS that was causing functional tests to fail, so I dug into it and here is what I got to.
fit._call_sampler(stan_args) in pystan's model module here even though stan_args are the same in both cases._call_sampler is defined in the cython file here._call_sampler is a template script that uses $model_cppname.hpp header file in the compilation process.Where it gets interesting is that the conda's python on linux is compiled with GCC, conda's python on mac is compiled with Clang (info below).
My best guess at this point is that having two different compilers generating the libraries is causing the L-BFGS optimization to take different routes. Any other thoughts or did I miss anythin here @bletham ?
__OS 1:__
Debian GNU/Linux 9 (stretch)
conda 4.7.10
python 3.7.4 [GCC 7.3.0] :: Anaconda, Inc. on linux
fbprophet 0.5
pystan 2.19.0.0
mkl 2019.4
mkl_random 1.1.0
__OS 2:__
Darwin Kernel Version 18.7.0
conda 4.7.12
python 3.7.4 [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
fbprophet 0.5
pystan 2.19.0.0
mkl 2019.4
mkl_random 1.1.0
@sammo that is a very thorough analysis! And consistent with my expectation.
One thing I didn't note above is that this seems to be a particular issue on problems where the likelihood surface is very flat near the optimum, which can particularly be the case in short time series where the model identifiability can be poor. Stan's Newton optimizer does seem to be more robust in these situations, and in past issues where this has come up people seem to find more reproducible results using that (e.g. #253 I think was the first report of this issue). You can set the optimizer like
m.fit(algorithm='Newton')
Thanks for the response @bletham . I'll try it out with Newton then.
Most helpful comment
@APramov thanks for the tips! I'll add a few things.
It is correct that by default the fitting uses a MAP estimate. Stan by default uses random initialization, but Prophet actually does not - it uses a fixed initialization for the fitting:
https://github.com/facebook/prophet/blob/190d3239fd2172c9bfcedd57b7cdefde56108bf8/python/fbprophet/forecaster.py#L1092
So the optimization is in principle deterministic. I say in principle because L-BFGS optimization is not actually totally deterministic, it depends on things like machine precision and the linear algebra packages installed on the system. My experience is that if I do the fitting on a single computer multiple times it will get the same result, but there are several issues on the github here that report examples where fitting the same time series on different machines has produced slightly different results due purely to differences in the L-BFGS optimization. These differences are always very small, but they can be present and unfortunately I do not think there is anything that can be done about them.
Though it sounds like the question here is mostly about uncertainty intervals, and for that there is a much large source of variability across runs. Trend uncertainty is estimated using Monte Carlo sampling from the trend change generative model. This is by default done with 1000 samples, and then the e.g. 80% intervals are computed by taking the 10th and 90th quantiles from these samples. I would expect differences in the uncertainty intervals across runs to primarily be MC variance from this sampling. You can tell if this is the case by calling
predicttwice on the same fitted model; any difference in predictions is 100% due to the MC estimate of trend uncertainty. There are two things that can be done. When instantiating the Prophet object, there is an argumentuncertainty_samplesthat specifies how many MC samples to do. Increasing this will reduce the MC variance, though at a cost of taking more time to runpredictbecause it has to do more simulations. You might have to increase it very high to get things to be totally repeatable. The other approach would be to set a random seed, as @APramov suggests. In Python, all of the random sampling is done in numpy so you'll have to set the seed in numpy: