Tpot: What is the best model of Tpot?

Created on 26 Feb 2020 · 6Comments · Source: EpistasisLab/tpot

Option 1: average of models on cross validation data.
Option 2: fit on all data.
Option 3: the best of fitting on 1 cross validation data.

I want to manual handle some parameter of models (as early stopping). I can't restructure current model.

Thank you very much.

question

Source

htluandc2

Most helpful comment

In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.

weixuanfu on 26 Feb 2020

🎉2

All 6 comments

In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.

weixuanfu on 26 Feb 2020

🎉2

I think it's not correct.

This is my model from Tpot:

from tpot import TPOTClassifier

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1352)
pipeline_optimizer = TPOTClassifier(generations=10,
                                    n_jobs=4,
                                    cv=skfold,
                                    max_eval_time_mins=5,
                                    population_size=10,
                                    periodic_checkpoint_folder="tpot/state3/",
                                    random_state=2,
                                    verbosity=2,
                                    scoring="roc_auc")

pipeline_optimizer.fit(X_train, y_train)

# Best model: XGBClassifier(learning_rate=0.1, max_depth=5, min_child_weight=16, n_estimators=100, nthread=1, subsample=0.6000000000000001)

pred_full = pipeline_optimizer.predict_proba(X_test)
# 'Results: [0.028645013, 0.016220812, 0.005172163, 0.0075716097, 0.02187296....]'

This is my code:

import xgboost as xgb

xgb_model = xgb.XGBClassifier(learning_rate=0.1,
                          max_depth=5,
                          min_child_weight=16,
                          n_estimators=100,
                          nthread=1,
                          subsample=0.6)

pred_full = 0

random_state = 1352
cv = 5

skfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=random_state)

print('On random state:', random_state)
auc_scores = []
current_models = []

for fold, (train_idx, val_idx) in enumerate(skfold.split(X_train, y_train)):
    if fold == 3:
        continue

    _X_train, _y_train = X_train.iloc[train_idx], y_train.iloc[train_idx]
    _X_val, _y_val = X_train.iloc[val_idx], y_train.iloc[val_idx]

    lgb_clf = copy.deepcopy(xgb_model)
    lgb_clf.fit(_X_train, _y_train,
                eval_set=[(_X_val, _y_val)],
                verbose=50,
                eval_metric='auc')

    y_pred = lgb_clf.predict_proba(_X_val)
    y_pred = y_pred[:, 1]
    new_scores = roc_auc_score(_y_val, y_pred)
    auc_scores.append(new_scores)
    current_models.append(copy.deepcopy(lgb_clf))

    y_test = lgb_clf.predict_proba(X_test)
    y_test = y_test[:, 1]

    pred_full += y_test

assert current_models[0] != current_models[1]

print('\nCurrent auc scores:', auc_scores)
print('Current gini scores:', make_gini(auc_scores))
print('Gini:', np.mean(make_gini(auc_scores)), ', std:', np.std(make_gini(auc_scores)))

pred_full /= len(current_models)
result = pd.DataFrame({
    'id': X_test.index,
    'label': pred_full
})

result.to_csv('xgb_basic.csv', index=False)

# Result: [0.01660117380820306, 0.016498301470441075, 0.015465876027333536, 0.0156495640327246, 0.016571651544708212, 0.015756741687708097, 0.01537791817766737...]

I can't make XGBoost model to fit training data as TPOT.

htluandc2 on 26 Feb 2020

Hmm, what do you mean "I can't make XGBoost model to fit training data as TPOT"? Are you trying to reproduce the average CV scores which was internally used in evaluation? if so, you may try to use cross_eval_score from scikit-learn to reproduce CV score (as we tested them in one unit test).

Also, it seems that random_state=2 is missing for xgb.XGBClassifier in your codes.

weixuanfu on 26 Feb 2020

🎉1

lgb_clf.fit(_X_train, _y_train,
            eval_set=[(_X_val, _y_val)],
            verbose=50,
            eval_metric='auc')

Also, TPOT follows basic scikit-learn API, but this usage above seems not a standard scikit-learn API but for xgboost's models only,

weixuanfu on 26 Feb 2020

Here is a demo for reproducing average cv score.

weixuanfu on 26 Feb 2020

👍1

I want to reproduce the results of Tpot.

pred_full = pipeline_optimizer.predict_proba(X_test)

May be I missed random_state of XGBoost.

htluandc2 on 27 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ImportWarning when import tpot modules

beijingtl · 4Comments

TPOTClassifier.set_params doesn't follow scikit-learn estimator API

TomAugspurger · 4Comments

Unable to import TPOTClassifier

stokjakub · 3Comments

increase n_estimators in config_dict default

crawles · 5Comments

Add support for scikit-learn Pipeline memory

rhiever · 4Comments