Tpot: Good training/testing performance, poor exported pipeline performance

Created on 15 Jan 2020  ·  7Comments  ·  Source: EpistasisLab/tpot

[provide general introduction to the issue and why it is relevant to this repository]

This isn’t really an issue with TPOT but wanted to post here before using stackoverflow as it does pertain to TPOT.

Before I post any code, I’d like to just give some context to check if anyone thinks I’m doing something wrong/get ideas on things to try before posting code. It might be a case of overfitting, but I’m not sure.

[provide more detailed introduction to the issue itself and why it is relevant]

I have a multivariate, pre-shuffled, balanced dataset of about 27,000 rows, that I’m using to train a binary classification model on (i.e. the class column has an even amount of 1’s and 0’s in it). I initially started with a 50/50 train test split and it achieved around 60-65% accuracy on the test data (I can’t recall exactly as I’m away from my computer) for my dataset which is great. So knowing that metric, I let TPOT train on the entire training dataset using a 10 k-fold cross validation and using “accuracy” as the performance metric. I then try to use the new exported pipeline on a new, unseen dataset in the exact same structure, and it performs horribly. It predicts a lot of 1’s when they really should be 0’s.

One thing I’m unclear on is, HOW to actually use the exported pipeline on a new dataset. How I’m using it now is, I use the .fit() function to re-fit the exported pipeline on the same training dataset that was used during the training process, and then try to .predict() on the new dataset. Is this correct? My reasoning is that, if the model uses a pre-processing step, like say minmaxscaler, it shouldn’t just pre-process on the new dataset because those local minmax values could be different than what was used during training, so fitting it to the same dataset first should take care of that step?

Some things I’ve thought of/saw online to try to rectify this might be to:

  1. Increase the number of folds as my dataset seems quite large
  2. Use a smaller train/test split. Maybe go back down to 50/50 and then try using THAT exported pipeline on new dataset
  3. Use “balance_accuracy” instead of “accuracy”
  4. Am I using the exported pipeline wrong?

What else can I try/what am I missing to see if I can get an exported pipeline to perform better on the new data? It seems like it overuses 1 predictions when they should be 0’s. Any ideas on what to try? Then I can try them and if I’m still stuck I can post my code. Thanks a lot!

question

All 7 comments

It seems TPOT's optimized pipeline was overfited on your whole training set.

I think you should try potential solution 2 and 3 first or use sub_sample option (discussed in #804) to solve the overfitting issue.

I don't think 1. may help a little but not very much since you already used 10-fold CV. and 3.

About 4., there are two ways to use the exported pipeline:

  1. as mentioned in this issue, you can refit the pipeline on the same training dataset and then use .predict() on the new unseen dataset. I think those pre-processing step should be only applied on the same training dataset to avoid data leakage.
  2. After finishing tpot.fit() step, dump the tpot.fitted_pipelines_ to a pickle file (please see this link for more details) and then you can used this fitted pipeline on another python kernel without refitting. Or, if you want test the performance on the new unseen dataset within the same python kernel, you can simply use tpot.fitted_pipeline_ on the dataset after reading it.

Thanks @weixuanfu . This gives me enough to try to for now. I'll close the issue now and try out your solutions and see how I make out. If for whatever reason I'm still having issues, I'll upload a re-producible script and datasets for further testing. Thanks again!

@weixuanfu You're right, I lowered my train/test split down to a 50/50 and changed to balanced accuracy and the model performed way better on out of sample. Thanks a lot for the recommendation.

I did have a new question for you, and this one does not really pertain to this issue, but I wanted to check here first before opening another issue. I recently discovered the possibility of class sensitive training for a classification model/adding sample weights during the .fit() function, but I'm a little unclear on how to go about using the feature properly. I've read through a few closed issues on the topic and tried to make sense of it, so I'll try and just describe what I'm after and maybe you can steer me in the right direction again.

I have the same dataset as described before (balanced, multi-variate, binary classification), and what I'd like to do is be able to add a weight to the 1's in my dataset, meaning that when the model predicts a 1, and it's actual value was a 0, that prediction should be penalized somehow, whereas if it predicted a 0 and it was actually supposed to be a 1, that shouldn't be treated as bad of a miss. I know this is something used more with an imbalanced dataset, but I feel it could be an important feature for my particular use case.

So with that in mind, is that what the sample_weight is used for? And if so, what's the proper syntax to add such a thing? I read the function listed in 708 but just a little unclear on how to make it work. Thanks a million! Love TPOT, works amazing!

Yes, sample_weight is used for setting different weights on samples. In your case here, I think you need assign a bigger weight for samples with actual value was 0 then if the prediction is 1, there is more penalty on accuracy score.

You can use sample_weight on tpot.fit() function via sample_weight parameter. We documented it in TPOT API .

sample_weight: array-like {n_samples}, optional

    Per-sample weights. Higher weights indicate more importance. If specified, sample_weight will be passed to any pipeline element whose fit() function accepts a sample_weight argument. By default, using sample_weight does not affect tpot's scoring functions, which determine preferences between pipelines. 

@weixuanfu Right I did see that, so if I wanted to assign say a weight of 2 to the 0’s in my dataset, would the parameter look something like sample_weight = {‘0’ : 2} ? Thanks a million again!

sample_weight should be array-like with shape of y. Something like:

sample_weight = np.ones(y.shape, dtype=int)
sample_weight[np.where(y = 0)] = 2

Oh the shape of Y, yes yes. Perfect thanks so much!!!!

Was this page helpful?
0 / 5 - 0 ratings