Tpot: Overfitting avoidance

Created on 18 Apr 2019  路  10Comments  路  Source: EpistasisLab/tpot

Is there any "automatic" overfitting avoidance in TPOT? From what I gather using TPOT, it selects the best model based on the highest cv score. However, training accuracy of the model is significantly higher than test. For example, the suggested model gives a KFold and train accuracy:

CV: 0.7389675227752329
Train: 0.9330253389536609

I know there are parameters in TPOT that help with overfitting such as subsample. However, since my understanding is that TPOT doesn't provide training scores, it's difficult to determine if there is overfitting or not. Does TPOT do early stopping? Or perhaps check the learning curve?

Thanks for any insight

question

All 10 comments

By default TPOT use CV scores to deal with overfitting issue. And the default CV splitter is StratifiedKFold for classification and KFold for regression. From our experiences, sometimes this CV splitter is not ideal. So 'cv' may need to be specified in some cases.

TPOT can do early stopping via early_stop option in TPOT API. For checking learning curve, you could try warm_start option (related to #832)?

Thank you for the help. I forgot I did have the early stop parameter set but still getting overfit models. I think I am missing something and sure I am doing something wrong. My understanding is TPOT will select the best model which maximizes CV accuracy (at least in part). However, shouldn't it select the best model with respect to the training accuracy? That is, omit models where the training accuracy is significantly higher than the CV accuracy. Or perhaps have a setting that enables that? If all models overfit with higher training accuracy then a warning message? Is this something that could be enabled or something that would be valuable? I am not sure how this is done with "autoML" or if I should be worried about this scenario. Thanks

Hi @jmrichardson thanks for the question!

Perfectly fit models can have very different training and testing accuracies. To assess the amount of overfitting, we need to look at the cross-validated accuracy. I'm not sure what you mean by this:

shouldn't it select the best model with respect to the training accuracy? That is, omit models where the training accuracy is significantly higher than the CV accuracy.

Except being useful for detecting something that went horribly wrong (e.g. extremely low training accuracy compare to CV or testing accuracy), I think training accuracy is largely useless. For example, here鈥檚 a nice stackexchange thread on the often almost-perfect performance of random forest:
https://stats.stackexchange.com/questions/162353/what-measure-of-training-error-to-report-for-random-forests

In short, to conclude on a model's overfitting, I would compare the cross-validated accuracy (instead of standard training accuracy) and testing accuracy of a model.

Let me know if this clarifies things up!

Hi, Thank you so much for the the reply and it really clears things up. After I posted, I read this article which echos your explanation as well as the link you provided. I think my confusion was based on the fastai ML course I have been taking that appears to suggest that a high training accuracy and lower CV accuracy indicates overfitting. Here are some of the snips of the course notebook:

[0.0904611534175684, 0.2517003033389636, 0.9828975209204237, 0.8868601882297901]
An r^2 in the high-80's isn't bad at all (and the RMSLE puts us around rank 100 of 470 on the Kaggle leaderboard), but we can see from the validation set score that we're over-fitting badly.

Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both?

[0.2317315086850927, 0.26334275954117264, 0.89225792718146846, 0.87615150359885019, 0.88097587673696554] <-- Notice how the training and CV accuracy are similar now as opposed to the above

Thanks again for the help! I have been trying forever to figure out why so many hyperparameter tuning packages don't select the best model with respect to training accuracy (error) to avoid overfitting. Essentially, the only way is to compare CV with the hold-out test set.

Sigh... So much to learn :)

Ha! I love fastai but this first notebook is in fact quite confusing. Now, if they define overfitting as "high training accuracy and lower testing accuracy", that's perfectly fine; however, if so, overfitting is not necessarily bad and hence doesn't need reducing (say, compare model 1 that yields 99% training and 95% testing accuracy vs model 2 that yields 70% accuracy for both training and testing). Maybe you should open an issue/question there!

I'm happy to help! It was quite confusing to me as well when I first used random forest and got almost perfect training accuracy.

Yes, that makes complete sense. I would rather have a slightly "overfitted" model with 95% accuracy than a model that yeilds 70% accuracy on both training and test. I did ask the question on the fastai forums for clarification. Lol, I literally almost switched platforms because I assumed that sklearn and tpot were not robust enough to avoid overfitting in hyperparameter tuning. Thank goodness you came to the rescue!

So I just thought of something... If the best way to test for overfitting is compare CV with test, it would be great if tpot would select the best model by also comparing to the hold out test set? I am not exactly sure how this would work but perhaps TPOT could make recommendations or actually select a model that doesn't appear to overfit. So, for example if the following models:

model1: 73% CV accuracy, 48% Test accuracy
model2: 75% CV accuracy, 49% Test accuracy
model3: 69% CV accuracy, 65% Test accuracy

Model 3 may be a better choice?

@jmrichardson Ah so this would be "double-dipping", perhaps another concept you can look up. We can't allow TPOT to look at the test set until after a model is selected. Theoretically, the CV accuracy should be an approximation of the test accuracy, unless the independent test set is very different from training set, e.g. sampled under a different distribution.

Ha, one of the first things Jeremy from the fastai course said was that one of the most common problems for data scientists is that they "peek" in the test set to make a model perform better. I thought, no way would I do that! I can't believe I just did... It's so tempting though! I did read an article that mentioned that nested cross validation (more resources required) would be a better choice for tuning parameters to avoid bias. I am not sure if TPOT can do this or not. Anyways, thanks again!

I am not sure if TPOT can do this or not.

Fascinating idea! Theoretically, yes. Practically, I'm not sure if we want to implement this feature because of the exact reason you mentioned: [MULTIPLICATIVELY] more resources required.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Anselmoo picture Anselmoo  路  3Comments

heaven00 picture heaven00  路  4Comments

qtisan picture qtisan  路  3Comments

stokjakub picture stokjakub  路  3Comments

omarcr picture omarcr  路  4Comments