Tpot: What is the best model of Tpot?

Created on 26 Feb 2020  路  6Comments  路  Source: EpistasisLab/tpot

Option 1: average of models on cross validation data.
Option 2: fit on all data.
Option 3: the best of fitting on 1 cross validation data.

I want to manual handle some parameter of models (as early stopping). I can't restructure current model.

Thank you very much.

question

Most helpful comment

In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.

All 6 comments

In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.

I think it's not correct.

This is my model from Tpot:

from tpot import TPOTClassifier

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1352)
pipeline_optimizer = TPOTClassifier(generations=10,
                                    n_jobs=4,
                                    cv=skfold,
                                    max_eval_time_mins=5,
                                    population_size=10,
                                    periodic_checkpoint_folder="tpot/state3/",
                                    random_state=2,
                                    verbosity=2,
                                    scoring="roc_auc")

pipeline_optimizer.fit(X_train, y_train)

# Best model: XGBClassifier(learning_rate=0.1, max_depth=5, min_child_weight=16, n_estimators=100, nthread=1, subsample=0.6000000000000001)

pred_full = pipeline_optimizer.predict_proba(X_test)
# 'Results: [0.028645013, 0.016220812, 0.005172163, 0.0075716097, 0.02187296....]'

This is my code:

import xgboost as xgb

xgb_model = xgb.XGBClassifier(learning_rate=0.1,
                          max_depth=5,
                          min_child_weight=16,
                          n_estimators=100,
                          nthread=1,
                          subsample=0.6)

pred_full = 0

random_state = 1352
cv = 5

skfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=random_state)

print('On random state:', random_state)
auc_scores = []
current_models = []

for fold, (train_idx, val_idx) in enumerate(skfold.split(X_train, y_train)):
    if fold == 3:
        continue

    _X_train, _y_train = X_train.iloc[train_idx], y_train.iloc[train_idx]
    _X_val, _y_val = X_train.iloc[val_idx], y_train.iloc[val_idx]

    lgb_clf = copy.deepcopy(xgb_model)
    lgb_clf.fit(_X_train, _y_train,
                eval_set=[(_X_val, _y_val)],
                verbose=50,
                eval_metric='auc')

    y_pred = lgb_clf.predict_proba(_X_val)
    y_pred = y_pred[:, 1]
    new_scores = roc_auc_score(_y_val, y_pred)
    auc_scores.append(new_scores)
    current_models.append(copy.deepcopy(lgb_clf))

    y_test = lgb_clf.predict_proba(X_test)
    y_test = y_test[:, 1]

    pred_full += y_test

assert current_models[0] != current_models[1]

print('\nCurrent auc scores:', auc_scores)
print('Current gini scores:', make_gini(auc_scores))
print('Gini:', np.mean(make_gini(auc_scores)), ', std:', np.std(make_gini(auc_scores)))

pred_full /= len(current_models)
result = pd.DataFrame({
    'id': X_test.index,
    'label': pred_full
})

result.to_csv('xgb_basic.csv', index=False)

# Result: [0.01660117380820306, 0.016498301470441075, 0.015465876027333536, 0.0156495640327246, 0.016571651544708212, 0.015756741687708097, 0.01537791817766737...]

I can't make XGBoost model to fit training data as TPOT.

Hmm, what do you mean "I can't make XGBoost model to fit training data as TPOT"? Are you trying to reproduce the average CV scores which was internally used in evaluation? if so, you may try to use cross_eval_score from scikit-learn to reproduce CV score (as we tested them in one unit test).

Also, it seems that random_state=2 is missing for xgb.XGBClassifier in your codes.

lgb_clf.fit(_X_train, _y_train,
            eval_set=[(_X_val, _y_val)],
            verbose=50,
            eval_metric='auc')

Also, TPOT follows basic scikit-learn API, but this usage above seems not a standard scikit-learn API but for xgboost's models only,

Here is a demo for reproducing average cv score.

I want to reproduce the results of Tpot.

pred_full = pipeline_optimizer.predict_proba(X_test)

May be I missed random_state of XGBoost.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

beijingtl picture beijingtl  路  4Comments

TomAugspurger picture TomAugspurger  路  4Comments

stokjakub picture stokjakub  路  3Comments

crawles picture crawles  路  5Comments

rhiever picture rhiever  路  4Comments