The simulated_historical_forecasts function currently doesn't account for the fact that in splitting an indicator variable across different cutoffs, you may run into a case where all the values take a zero or one.
This throws an error in the initalize_scales_fn because the check for uniqueness of regressor, given below, fails:
for (name in names(m$extra_regressors)) {
n.vals <- length(unique(df[[name]]))
if (n.vals < 2) {
stop('Regressor ', name, ' is constant.')
}
I handle this by making the following changes in the function:
regressor_names <- names(model$extra_regressors)
# check that regressor we added is not entirely constant
if (!is.null(regressor_names)) { # start of if
# number of unique values for regressors in history.c
num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))
# which regressors should we remove
regressors_to_remove <- names(which(num_unique_by_regressor < 2))
if (length(regressors_to_remove) > 0) {
# remove the regressors from model
for (name in regressors_to_remove){
m$extra_regressors[[name]] <- NULL
}
# remove attributes for consistency
if (!is.null(attr(m$extra_regressors, which = 'names'))){
attr(m$extra_regressors, which = 'names') <- NULL
}
# remove the regressors from history.c
history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
}
} # end of if
```
The entire function then becomes -->
simulated_historical_forecasts <- function(model, horizon, units, k,
period = NULL) {
df <- model$history
horizon <- as.difftime(horizon, units = units)
if (is.null(period)) {
period <- horizon / 2
} else {
period <- as.difftime(period, units = units)
}
# regressor names
regressor_names <- names(model$extra_regressors)
cutoffs <- generate_cutoffs(df, horizon, k, period)
predicts <- data.frame()
for (i in 1:length(cutoffs)) {
cutoff <- cutoffs[i]
# Copy the model
m <- prophet_copy(model, cutoff)
# Train model
history.c <- dplyr::filter(df, ds <= cutoff)
# check that regressor we added is not entirely constant
if (!is.null(regressor_names)) {
# number of unique values for regressors in history.c
num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))
# which regressors should we remove
regressors_to_remove <- names(which(num_unique_by_regressor < 2))
if (length(regressors_to_remove) > 0) {
# remove the regressors from model
for (name in regressors_to_remove){
m$extra_regressors[[name]] <- NULL
}
# remove attributes for consistency
if (!is.null(attr(m$extra_regressors, which = 'names'))){
attr(m$extra_regressors, which = 'names') <- NULL
}
# remove regressors from history.c
history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
}
}
# fit model
m <- fit.prophet(m, history.c)
# Calculate yhat
df.predict <- dplyr::filter(df, ds > cutoff, ds <= cutoff + horizon)
columns <- c('ds')
if (m$growth == 'logistic') {
columns <- c(columns, 'cap')
if (m$logistic.floor) {
columns <- c(columns, 'floor')
}
}
columns <- c(columns, regressor_names)
future <- df[columns]
yhat <- stats::predict(m, future)
# Merge yhat, y, and cutoff.
df.c <- dplyr::inner_join(df.predict, yhat, by = "ds")
df.c <- dplyr::select(df.c, ds, y, yhat, yhat_lower, yhat_upper)
df.c$cutoff <- cutoff
predicts <- rbind(predicts, df.c)
}
return(predicts)
}
```
This is a challenging issue. There are certainly ways to get around this and the solution you post is one. But it isn't clear to me what the right thing to do is for making the cross validation meaningful. Our goal is to estimate model generalization. If the the external regressor is important, then removing it means we're now fitting a different model, whose performance is probably not indicative of the generalization performance of the full model.
It seems to me the more reasonable thing to do would be to not try to do cross-validation using segments of the history that do not contain all of the data needed by the model (like both levels of an indicator variable). Since the cross-validation uses histories of increasing length, we should really just start the cross validation at a point in the history that has everything we need. This might mean fewer samples to estimate performance, but like I said above, otherwise we are getting more samples of something that isn't really the generalization we want to estimate.
Any thoughts on this are appreciated.
Hi Ben,
Thanks for your response. I think the points you raise are fair and make more sense from a statistical perspective. In retrospect, my suggestion seems more like a hack to get around the model fitting error.
I think it would indeed be better to simply ignore such a segment of history where we don't have all the data needed.
You can close the issue if you want :)
Rohail
At the least we should provide some better messaging about this, and probably something in the documentation, so I'll leave this open. Also if anyone else is trying this and has thoughts on what would be useful please chime in :-)
For example you have 1 years data of a product on promotion followed by another year without promotion.
Let's assume we are not planning any promotions for the next 26 weeks. We still should be able to both cross-validate it and/or forecast with "Zero" as a regressor. It really is a needed functionality for me to get a baseline.
Does the model normalize the external regressors between 0 and 1 and then use 0.5 for the middle value of the external regressors? (In a way, I know this is not precise)
@deniznoah That seems like a reasonable use case. We'll then need to have a way to drop constant extra regressors in fitting.
As for whether or not they are normalized - Not-binary extra regressors are standardized (subtract mean, divide by standard deviation) so they are mean 0 and have standard deviation 1. Binary extra regressors are left as-is. This behavior can be overriden when adding them, see help(Prophet.add_regressor).
Does the tool check for constant regressors during fitting only? I thought it also checks for constant regressors in the future dataframe?
Deniz
https://github.com/facebook/prophet/commit/107f74f0f2e0f56e9d543a8bbc6889a1376b7102 allows constant regressors generally, and so will fix this issue specifically.
Pushed out with v0.3.