Tpot: Is it possible to deal with imbalanced data?

Created on 20 Jan 2019 · 2Comments · Source: EpistasisLab/tpot

I have a dataset which normally has a binary class in two cases:
case 1:
label = 1 which is the minority data and important
label = -1 which is the majority data and not important

case 2:
label = 1 which is the majority data and not important
label = -1 which is the minority data and important
(the data ratio of them are normally 1:2 in case 1 and 2:1 in case 2.)

I divided the dataset into three parts, train/test for training and validation for unseen data, whatever I used the sampling to balanced to train/test data or not, it seems that the model produced by TPOT cannot produce a good result of my validation data.

For unbalanced data, I separately used 'f1' and 'roc_auc' scoring metric, all the produced pipelines would all predict to the majority label.
For balanced data, I separately used 'f1', 'roc_auc' and 'accuracy' scoring metric, the produced pipeline will give obviously different and inconsistent validation result among each fitting with the same pipeline.

I also applied StratifiedShuffleSplit and set the test data as 0.33 to cv parameter. And the ratio of two labels in validation data and imbalanced train/test data is the same.

Are there any scoring metrics that I can apply with to improve the problem?

question

Source

ckliu1402

Most helpful comment

Maybe scoring='balanced_accuracy' can improve the problem. It is related to this builtin metrics)

weixuanfu on 22 Jan 2019

👍2

All 2 comments

Maybe scoring='balanced_accuracy' can improve the problem. It is related to this builtin metrics)

weixuanfu on 22 Jan 2019

👍2

Thanks for answering. What if I want to focus on 'f1' or 'precision' of minority class in the unseen data, is it still meaningful for using 'f1' score for minority class in balanced training data or 'balanced_accuracy' in unbalanced data?

ckliu1402 on 23 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings