Prophet: Model performance comparison for different timeseries without plotting the data

Created on 23 Apr 2020 · 3Comments · Source: facebook/prophet

Hi,
I am trying to compare multiple models performance, each for different timeseries.

There is too many cases to check each manually so the goal is to find value of the metric which will reflect if model is good without looking at the plot.

The problem is that each of the timeseries have significantly different values (for example one is 0-100 when another 0-1million).

I trained each model and save mean of the performance metric (MAPE):

df_cv = cross_validation(model, initial='730 days', period='180 days', horizon = '365 days')
df_pd = performance_metrics(df_cv)
mape_mean = df_pd.mape.mean()

But value of MAPE is highly affected by different range of the values in my timeseries.
I was trying to implement SMAPE

forecast_vs_train = df_cv.merge(train_pd, left_on='ds', right_on='ds', how='inner')

predicted = forecast_vs_test.yhat
actual = forecast_vs_test.y

cutoffs = forecast_vs_train.cutoff.unique()

SMAPE_arr = []

for c in cutoffs:
    SMAPE = sum(abs(predicted - actual) / ((abs(actual) + abs(predicted))/2)) * 100 / actual.size
    SMAPE.append(SMAPE_arr)

SMAPE_cv = statistics.mean(SMAPE_arr)

My questions are:

Is averaging by cutoffs consistent with Prophet logic?
Maybe you have some recommendations how to deal with this problem?

Source

annabednarska

All 3 comments

Yes, averaging over cutoffs is correct. The way you can think about it is that each cutoff gives an estimate for the forecast error, and we're trying to estimate the average forecast error and so average over these. The built-in performance_metrics function averages over cutoffs.

One thing you're also doing here is averaging over values of the horizon for each individual forecasted point (ds - cutoff). So in this case where the horizon is 365 days, you're estimating the average forecast error across that entire horizon window. That's a reasonable thing to get a single number, and for the built-in metrics you can get this by doing

performance_metrics(df_cv, rolling_window=1.0)

(so you don't have to take the mean afterwards).

For computing SMAPE, it looks like the loop over cutoffs isn't doing anything, but shouldn't be necessary anyway. You can compute the symmetric APE at each point in df_cv, and then just take the mean of them all. Like this:

SMAPE = np.mean(
    np.abs(df_cv['yhat']-df_cv['y']) / ((np.abs(df_cv['y']) + np.abs(df_cv['yhat'])) /2)
)

Final point: you shouldn't need to do the inner join between df_cv and train_pd, df_cv already has the true y values in it (basically it's already done the join for you).

bletham on 23 Apr 2020

👍1

Thank you for reply.

Finally I decided to calculate mean absolute scaled error (MASE). It was proposed by Koehler & Hyndman (2006) and recommended as good metric to compare results for different models and timeseries with different scales: MASE = MAEprophet/MAEnaive where:

MAEprophet is the mean absolute error produced by prophet forecast (value from built-in performance_metrics)
MAEnaive is the mean absolute error produced by naive forecast (shifted timeseries –> previous observation is used directly as forecast)

The lower MASE value, the better prophet fit.
In general while comparing MASE values with plotted results, metric seems to reflect well if model is good without looking at the plot.

annabednarska on 6 May 2020

👍1

That seems like a great metric!

bletham on 6 May 2020

🚀1

Was this page helpful?

0 / 5 - 0 ratings