Prophet: constant regressor error with simulated_historical_forecasts with an indicator variable used as regressor

Created on 15 Nov 2017 · 10Comments · Source: facebook/prophet

The simulated_historical_forecasts function currently doesn't account for the fact that in splitting an indicator variable across different cutoffs, you may run into a case where all the values take a zero or one.

This throws an error in the initalize_scales_fn because the check for uniqueness of regressor, given below, fails:

for (name in names(m$extra_regressors)) {
    n.vals <- length(unique(df[[name]]))
    if (n.vals < 2) {
      stop('Regressor ', name, ' is constant.')
    }

I handle this by making the following changes in the function:

regressor_names <- names(model$extra_regressors)
# check that regressor we added is not entirely constant

    if (!is.null(regressor_names)) { # start of if
    # number of unique values for regressors in history.c
    num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))

    # which regressors should we remove
    regressors_to_remove <- names(which(num_unique_by_regressor < 2))

    if (length(regressors_to_remove) > 0) {
     # remove the regressors from model
      for (name in regressors_to_remove){
        m$extra_regressors[[name]] <- NULL
    }

    # remove attributes for consistency

    if (!is.null(attr(m$extra_regressors, which = 'names'))){
      attr(m$extra_regressors, which = 'names') <- NULL
    }

   # remove the regressors from history.c
    history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
   }                          
} # end of if
  ```
The entire function then becomes -->

simulated_historical_forecasts <- function(model, horizon, units, k,
period = NULL) {
df <- model$history
horizon <- as.difftime(horizon, units = units)
if (is.null(period)) {
period <- horizon / 2
} else {
period <- as.difftime(period, units = units)
}
# regressor names
regressor_names <- names(model$extra_regressors)
cutoffs <- generate_cutoffs(df, horizon, k, period)
predicts <- data.frame()
for (i in 1:length(cutoffs)) {
cutoff <- cutoffs[i]
# Copy the model
m <- prophet_copy(model, cutoff)
# Train model
history.c <- dplyr::filter(df, ds <= cutoff)
# check that regressor we added is not entirely constant

if (!is.null(regressor_names)) {
  # number of unique values for regressors in history.c
  num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))

  # which regressors should we remove
  regressors_to_remove <- names(which(num_unique_by_regressor < 2))

  if (length(regressors_to_remove) > 0) {

    # remove the regressors from model
    for (name in regressors_to_remove){
      m$extra_regressors[[name]] <- NULL
    }

    # remove attributes for consistency

    if (!is.null(attr(m$extra_regressors, which = 'names'))){
      attr(m$extra_regressors, which = 'names') <- NULL
    }

    # remove regressors from history.c
    history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
  }
}

# fit model
m <- fit.prophet(m, history.c)
# Calculate yhat
df.predict <- dplyr::filter(df, ds > cutoff, ds <= cutoff + horizon)
columns <- c('ds')
if (m$growth == 'logistic') {
  columns <- c(columns, 'cap')
  if (m$logistic.floor) {
    columns <- c(columns, 'floor')
  }
}
columns <- c(columns, regressor_names)
future <- df[columns]
yhat <- stats::predict(m, future)
# Merge yhat, y, and cutoff.
df.c <- dplyr::inner_join(df.predict, yhat, by = "ds")
df.c <- dplyr::select(df.c, ds, y, yhat, yhat_lower, yhat_upper)
df.c$cutoff <- cutoff
predicts <- rbind(predicts, df.c)

}
return(predicts)
}
```

ready

Source

roumail

All 10 comments

This is a challenging issue. There are certainly ways to get around this and the solution you post is one. But it isn't clear to me what the right thing to do is for making the cross validation meaningful. Our goal is to estimate model generalization. If the the external regressor is important, then removing it means we're now fitting a different model, whose performance is probably not indicative of the generalization performance of the full model.

It seems to me the more reasonable thing to do would be to not try to do cross-validation using segments of the history that do not contain all of the data needed by the model (like both levels of an indicator variable). Since the cross-validation uses histories of increasing length, we should really just start the cross validation at a point in the history that has everything we need. This might mean fewer samples to estimate performance, but like I said above, otherwise we are getting more samples of something that isn't really the generalization we want to estimate.

bletham on 17 Nov 2017

👍1

Any thoughts on this are appreciated.

bletham on 17 Nov 2017

👍1

Hi Ben,

Thanks for your response. I think the points you raise are fair and make more sense from a statistical perspective. In retrospect, my suggestion seems more like a hack to get around the model fitting error.

I think it would indeed be better to simply ignore such a segment of history where we don't have all the data needed.

You can close the issue if you want :)
Rohail

roumail on 17 Nov 2017

At the least we should provide some better messaging about this, and probably something in the documentation, so I'll leave this open. Also if anyone else is trying this and has thoughts on what would be useful please chime in :-)

bletham on 17 Nov 2017

👍1

For example you have 1 years data of a product on promotion followed by another year without promotion.

Let's assume we are not planning any promotions for the next 26 weeks. We still should be able to both cross-validate it and/or forecast with "Zero" as a regressor. It really is a needed functionality for me to get a baseline.

denizn on 31 May 2018

Does the model normalize the external regressors between 0 and 1 and then use 0.5 for the middle value of the external regressors? (In a way, I know this is not precise)

denizn on 31 May 2018

@deniznoah That seems like a reasonable use case. We'll then need to have a way to drop constant extra regressors in fitting.

As for whether or not they are normalized - Not-binary extra regressors are standardized (subtract mean, divide by standard deviation) so they are mean 0 and have standard deviation 1. Binary extra regressors are left as-is. This behavior can be overriden when adding them, see help(Prophet.add_regressor).

bletham on 1 Jun 2018

👍1

Does the tool check for constant regressors during fitting only? I thought it also checks for constant regressors in the future dataframe?

Deniz

denizn on 1 Jun 2018

https://github.com/facebook/prophet/commit/107f74f0f2e0f56e9d543a8bbc6889a1376b7102 allows constant regressors generally, and so will fix this issue specifically.

bletham on 1 Jun 2018

Pushed out with v0.3.

bletham on 7 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings