Option 1: average of models on cross validation data.
Option 2: fit on all data.
Option 3: the best of fitting on 1 cross validation data.
I want to manual handle some parameter of models (as early stopping). I can't restructure current model.
Thank you very much.
In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.
I think it's not correct.
This is my model from Tpot:
from tpot import TPOTClassifier
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1352)
pipeline_optimizer = TPOTClassifier(generations=10,
n_jobs=4,
cv=skfold,
max_eval_time_mins=5,
population_size=10,
periodic_checkpoint_folder="tpot/state3/",
random_state=2,
verbosity=2,
scoring="roc_auc")
pipeline_optimizer.fit(X_train, y_train)
# Best model: XGBClassifier(learning_rate=0.1, max_depth=5, min_child_weight=16, n_estimators=100, nthread=1, subsample=0.6000000000000001)
pred_full = pipeline_optimizer.predict_proba(X_test)
# 'Results: [0.028645013, 0.016220812, 0.005172163, 0.0075716097, 0.02187296....]'
This is my code:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.1,
max_depth=5,
min_child_weight=16,
n_estimators=100,
nthread=1,
subsample=0.6)
pred_full = 0
random_state = 1352
cv = 5
skfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=random_state)
print('On random state:', random_state)
auc_scores = []
current_models = []
for fold, (train_idx, val_idx) in enumerate(skfold.split(X_train, y_train)):
if fold == 3:
continue
_X_train, _y_train = X_train.iloc[train_idx], y_train.iloc[train_idx]
_X_val, _y_val = X_train.iloc[val_idx], y_train.iloc[val_idx]
lgb_clf = copy.deepcopy(xgb_model)
lgb_clf.fit(_X_train, _y_train,
eval_set=[(_X_val, _y_val)],
verbose=50,
eval_metric='auc')
y_pred = lgb_clf.predict_proba(_X_val)
y_pred = y_pred[:, 1]
new_scores = roc_auc_score(_y_val, y_pred)
auc_scores.append(new_scores)
current_models.append(copy.deepcopy(lgb_clf))
y_test = lgb_clf.predict_proba(X_test)
y_test = y_test[:, 1]
pred_full += y_test
assert current_models[0] != current_models[1]
print('\nCurrent auc scores:', auc_scores)
print('Current gini scores:', make_gini(auc_scores))
print('Gini:', np.mean(make_gini(auc_scores)), ', std:', np.std(make_gini(auc_scores)))
pred_full /= len(current_models)
result = pd.DataFrame({
'id': X_test.index,
'label': pred_full
})
result.to_csv('xgb_basic.csv', index=False)
# Result: [0.01660117380820306, 0.016498301470441075, 0.015465876027333536, 0.0156495640327246, 0.016571651544708212, 0.015756741687708097, 0.01537791817766737...]
I can't make XGBoost model to fit training data as TPOT.
Hmm, what do you mean "I can't make XGBoost model to fit training data as TPOT"? Are you trying to reproduce the average CV scores which was internally used in evaluation? if so, you may try to use cross_eval_score from scikit-learn to reproduce CV score (as we tested them in one unit test).
Also, it seems that random_state=2 is missing for xgb.XGBClassifier in your codes.
lgb_clf.fit(_X_train, _y_train, eval_set=[(_X_val, _y_val)], verbose=50, eval_metric='auc')
Also, TPOT follows basic scikit-learn API, but this usage above seems not a standard scikit-learn API but for xgboost's models only,
Here is a demo for reproducing average cv score.
I want to reproduce the results of Tpot.
pred_full = pipeline_optimizer.predict_proba(X_test)
May be I missed random_state of XGBoost.
Most helpful comment
In general, TPOT uses option 1 to evaluate models/pipelines. The best model/pipelines after evaluation should be fitted on all data for scoring and prediction.