Tpot: Question: Can multiple models be saved/exported from one TPOT optimization?

Created on 7 Jun 2018 · 3Comments · Source: EpistasisLab/tpot

Hi, I'm new to using TPOT, but it looks very promising. My question is if it is possible to save the parameters and/or export code for multiple models from one TPOT optimization, rather than just the most predictive model.

For example, I would like to see the top 5 (or whatever) models to assess if there are any common features, or maybe to choose a model that is easier to explain compared to one with marginal improvements in accuracy.

Thanks!

question

Source

pan-alex

Most helpful comment

Thanks @weixuanfu and @rhiever! This answered my question.

For anyone with similar needs I've included some code using the minimal working example on the MNIST data from the tpot documentation.

import pandas as pd
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Data frame of all models and sort by best CV error:

my_dict = list(tpot.evaluated_individuals_.items())

model_scores = pd.DataFrame()
for model in my_dict:
    model_name = model[0]
    model_info = model[1]
    cv_score = model[1].get('internal_cv_score')  # Pull out cv_score as a column (i.e., sortable)
    model_scores = model_scores.append({'model': model_name,
                                        'cv_score': cv_score,
                                        'model_info': model_info,},
                                       ignore_index=True)

model_scores = model_scores.sort_values('cv_score', ascending=False)

E.g., top 5 rows of output

| cv_score | model | model_info
-- | -- | -- | --
276 | 0.986708 | KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=9, DecisionTreeClassifier__min_samples_leaf=3, DecisionTreeClassifier__min_samples_split=14), KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform) | {'generation': 'INVALID', 'mutation_count': 3, 'crossover_count': 0, 'predecessor': ('KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)',), 'operator_count': 2, 'internal_cv_score': 0.9867084418027815}
251 | 0.986686 | KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=3, DecisionTreeClassifier__min_samples_leaf=18, DecisionTreeClassifier__min_samples_split=16), KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform) | {'generation': 'INVALID', 'mutation_count': 1, 'crossover_count': 2, 'predecessor': ('KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=3, DecisionTreeClassifier__min_samples_leaf=18, DecisionTreeClassifier__min_samples_split=16), KNeighborsClassifier__n_neighbors=19, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)', 'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=distance)'), 'operator_count': 2, 'internal_cv_score': 0.9866863255542502}
244 | 0.986686 | KNeighborsClassifier(GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=1.0, GradientBoostingClassifier__max_depth=1, GradientBoostingClassifier__max_features=0.15000000000000002, GradientBoostingClassifier__min_samples_leaf=20, GradientBoostingClassifier__min_samples_split=11, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.7500000000000001), KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform) | {'generation': 'INVALID', 'mutation_count': 3, 'crossover_count': 0, 'predecessor': ('KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)',), 'operator_count': 2, 'internal_cv_score': 0.9866863255542502}
202 | 0.986686 | KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform) | {'generation': 'INVALID', 'mutation_count': 2, 'crossover_count': 0, 'predecessor': ('KNeighborsClassifier(RFE(input_matrix, RFE__ExtraTreesClassifier__criterion=gini, RFE__ExtraTreesClassifier__max_features=0.1, RFE__ExtraTreesClassifier__n_estimators=100, RFE__step=0.6500000000000001), KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)',), 'operator_count': 1, 'internal_cv_score': 0.9866863255542502}
89 | 0.986678 | KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=distance) | {'generation': 'INVALID', 'mutation_count': 1, 'crossover_count': 0, 'predecessor': ('KNeighborsClassifier(RFE(input_matrix, RFE__ExtraTreesClassifier__criterion=gini, RFE__ExtraTreesClassifier__max_features=0.1, RFE__ExtraTreesClassifier__n_estimators=100, RFE__step=0.6500000000000001), KNeighborsClassifier__n_neighbors=5, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=distance)',), 'operator_count': 1, 'internal_cv_score': 0.9866781855461101}

pan-alex on 8 Jun 2018

👍6

All 3 comments

For now, tpot will save all pipelines into evaluated_individuals_ attributes in TPOT API. Similar issue #516

evaluated_individuals_: Python dictionary
Dictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).

This attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated.

weixuanfu on 7 Jun 2018

You can also access the pareto_front_fitted_pipelines_ attribute of TPOT, which provides a list of all pipelines along the Pareto front. This Pareto front shows the trade-off between pipeline predictive performance (e.g., accuracy) and pipeline complexity (i.e., the number of steps in the pipeline).

See the TPOT API for more info on the pareto_front_fitted_pipelines_ attribute.

rhiever on 7 Jun 2018

Thanks @weixuanfu and @rhiever! This answered my question.

For anyone with similar needs I've included some code using the minimal working example on the MNIST data from the tpot documentation.

import pandas as pd
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Data frame of all models and sort by best CV error:

my_dict = list(tpot.evaluated_individuals_.items())

model_scores = pd.DataFrame()
for model in my_dict:
    model_name = model[0]
    model_info = model[1]
    cv_score = model[1].get('internal_cv_score')  # Pull out cv_score as a column (i.e., sortable)
    model_scores = model_scores.append({'model': model_name,
                                        'cv_score': cv_score,
                                        'model_info': model_info,},
                                       ignore_index=True)

model_scores = model_scores.sort_values('cv_score', ascending=False)

E.g., top 5 rows of output

pan-alex on 8 Jun 2018

👍6

Was this page helpful?

0 / 5 - 0 ratings