Prophet's paper (forecasting at scale by SJ Taylor - 2017) says the following on missing data:
"Unlike ARIMA models, the measurments do not need to be regularly spaced, and we do not need to interpolate missing values e.g. from removing outliers"
But I want to know if fprophet fo not interpolate missing values, what does it instead? How he handles missing data?
The model is a regression model on continuous times. So, we have some set of times t_1, ..., t_n, and we observed y_1, ..., y_n at those times, and try to estimate the function y = f(t). There's no requirement on the times t_1, ..., t_n to have any specific regularity. Because the model is continuous time, we anyway do not have a value at every possible value of t (there are infinitely many!), and so there's no problem with having a day missing.
A good parallel would be a linear regression y on x, where a "missing value" would mean that we don't have data at some particular x. This is no problem for a regression model.
Hi, Ben, how about regressors? If I have 3 regressors, I have to make sure none is missing for both modeling and prediction, right?
@MiraLS that's right. With regressors you can still have irregularly spaced data (like a day missing for all of them), but for the days / times that you do have data, you must have all of them observed (so, y and all of the regressors). And then you need to know all regressors for any future points that you want to predict. If there are days where you have observations of y but are missing one or more regressors, you'd either have to remove that from your history (so you don't use that value of y), or you'd have to impute the regressors.
Most helpful comment
The model is a regression model on continuous times. So, we have some set of times t_1, ..., t_n, and we observed y_1, ..., y_n at those times, and try to estimate the function y = f(t). There's no requirement on the times t_1, ..., t_n to have any specific regularity. Because the model is continuous time, we anyway do not have a value at every possible value of t (there are infinitely many!), and so there's no problem with having a day missing.
A good parallel would be a linear regression y on x, where a "missing value" would mean that we don't have data at some particular x. This is no problem for a regression model.