Greetings,
First of all, thank you for the amazing job you did on this project. I'm trying to use TPOT on a research context, and after a few tests, I have some questions about how to use it :
I've seen in the issue #337 that we can retrieve the explored pipelines with the tpot._evaluated_individuals set. Is this a good way to use it, or could it change over the versions? I want to be able to retrieve the best model, the features and the parameters to store it into a DB.
Is there any way to retrieve the best features, as shown in this paper on page 12, and to know from which the constructed ones have been based on?
Thank you for your help,
Kind regards,
Axel.
For the first question, I think this is right way to use it for current version and next version (0.8). You can also find the example in the unit test, test_evaluated_individuals, if the way is updated in future versions.
For the second one. for now TPOT cannot provide the ranking of features' importance as Figure 5 in the paper. The importance of features on page 12 was estimated using Random Forest method.
Hi @axelroy,
If you want to access the best model from the TPOT run, you can access it with the tpot. _fitted_pipeline property at the end of the run. If you run TPOT at the highest verbosity (3), you can also access the entire Pareto front of best pipelines with the tpot._pareto_front_fitted_pipelines property. Note that both of these properties are only assigned at the end of a TPOT run.
In terms of presenting feature importances, those are limited to specific models. In the case of the paper you linked, those were decision trees and random forests, so I was displaying tree-based feature importances. If TPOT discovers a pipeline for you that uses a decision tree or other tree-based method as the final classifier, for example, then you could access those feature importances with the following code:
# The first index to -1 gets the last step in the pipeline
# The second index to 1 gets the actual classifier object
tpot._fitted_pipeline.steps[-1][1].feature_importances_
which is an array of feature importances that you can then match with the feature names. The same applies with linear models, except you'd access the coef_ property instead.
Thank you very much for the responses, I'll test this as soon as possible.
Closing this issue for now. Please feel free to re-open if you have any more questions or comments.
I'm using tpot and loving it, but am struggling to join the names of the features I provide tpot with the list of feature importances that I extract using tpot._fitted_pipeline.steps[-1][1].feature_importances_. I understand that this is because tpot is building and evaluating new synthetic features. Do you have a recommended method for either or both of the following: (1) disabling synthetic feature generation so I can zip my feature names to the feature importances; or (2) appending names of the generated features to my list of feature names so I can zip them with feature importances? Ideally, I'd like to be able to do something like this:
for feature_name, feature_score in zip(df.drop('class', axis=1).columns, tpot._fitted_pipeline.steps[-1][1].feature_importances_):
print(feature_name, '\t', feature_score)
Here's an example pipeline to which I would like to apply such a method:
{'config_dict': {'sklearn.ensemble.RandomForestClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.tree.DecisionTreeClassifier': {'criterion': ['gini', 'entropy'], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21)}, 'sklearn.ensemble.ExtraTreesClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0. , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,
0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,
0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,
0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}}, 'crossover_rate': 0.1, 'cv': 5, 'disable_update_check': False, 'early_stop': None, 'generations': 10, 'max_eval_time_mins': 5, 'max_time_mins': None, 'memory': None, 'mutation_rate': 0.9, 'n_jobs': 7, 'offspring_size': 10, 'periodic_checkpoint_folder': None, 'population_size': 10, 'random_state': None, 'scoring': None, 'subsample': 1.0, 'verbosity': 2, 'warm_start': False}
Apologies if I missed this being addressed previously.
Hmm, I think those synthetic features should be in those first (left) columns but they usually had very high importance scores in the last operator of pipeline.
For now, TPOT does not provide this option for disabling synthetic feature generation. But:
One of my dev branch of TPOT called noCDF_noStacking has a option named
simple_pipeline, which can disable bothStackingEstimatorandCombineDFsifsimple_pipeline=True(e.g.TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT withoutStackingEstimatorandFeatureUnion, you may install this branch in your test environment via the command below:pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking
Please check #152 for more details. We are working on a more advanced pipeline configuration option.
Thanks @weixuanfu ! For purposes of transparency, explainability, and trust, it would be lovely to have the ability to connect TPOT to something like eli5 for feature importance inspection and exploration. This may not be so important for biological work (I don't really know), but for public safety work, it's quite important to be able to be able to explain -- if only very roughly -- how a model works.
Most helpful comment
Thanks @weixuanfu ! For purposes of transparency, explainability, and trust, it would be lovely to have the ability to connect TPOT to something like eli5 for feature importance inspection and exploration. This may not be so important for biological work (I don't really know), but for public safety work, it's quite important to be able to be able to explain -- if only very roughly -- how a model works.