Tpot: Brainstorm: How can we keep the pipelines trained so they don't need to re-train on each call?

Created on 12 Nov 2015  路  10Comments  路  Source: EpistasisLab/tpot

Currently, TPOT requires the training data to be passed along with any additional data so the pipeline can train the sklearn models on the training data again. This is required because the pipeline consists of functions: each random forest, decision tree, etc. is a function and the model is garbage collected as soon as the function terminates.

Let's brainstorm: How can we design these functions so the models remain persistent and don't need to be re-trained? This is really only important for the final pipeline, where the user would be performing score() and predict() calls against the pipeline.

need contributor question

Most helpful comment

@KobaKhit I think pickle is generally a good idea in terms of efficiency; however, and maybe I am too paranoid :P, I always worry about compatibility issues (e.g., the different pickle protocols and Py version incompatibilities). In any case, if pickle is only used internally during model training on the particular system where TPOT is "trained", this wouldn't be a problem I guess.
However, instead of pickle, I'd suggest joblib maybe since it's better at storing NumPy arrays; there was a discussion on the mailing list today where someone pickled a random forest (50 trees, 23 features, 20 mb dataset) -> 50 mb with standard pickle, ~15 mb with joblib.

Personally, I switched over to using JSON files for model persistence (e.g., when using scikit-learn or also other things). This way, I always have a human readable record of everything (in case pickle files get corrupted or are incompatible in a different environment); sure, this would probably slower computationally, but I think it would be the more "robust" or "reproducible" option. Someone else asked me about that recently so I put up a quick ipynb with an example specific to sklearn if you are interested: http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb

All 10 comments

http://blaze.pydata.org/blog/2015/10/19/dask-learn/ Check this out

Dask can also be used to parallelize.

We should look at using sklearn Pipeline objects to represent our pipelines. I believe that could go a long way toward solving this issue.

Might be helpful 3.4. Model persistence. Basically, an example of how to save a model in as a pickle. Below is code example from linked page

>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

Maybe create a new self variable self.saved_pipe = pickle.dumps(clf) that saves the pipeline of the best model in a generation and update it every generation.

@KobaKhit I think pickle is generally a good idea in terms of efficiency; however, and maybe I am too paranoid :P, I always worry about compatibility issues (e.g., the different pickle protocols and Py version incompatibilities). In any case, if pickle is only used internally during model training on the particular system where TPOT is "trained", this wouldn't be a problem I guess.
However, instead of pickle, I'd suggest joblib maybe since it's better at storing NumPy arrays; there was a discussion on the mailing list today where someone pickled a random forest (50 trees, 23 features, 20 mb dataset) -> 50 mb with standard pickle, ~15 mb with joblib.

Personally, I switched over to using JSON files for model persistence (e.g., when using scikit-learn or also other things). This way, I always have a human readable record of everything (in case pickle files get corrupted or are incompatible in a different environment); sure, this would probably slower computationally, but I think it would be the more "robust" or "reproducible" option. Someone else asked me about that recently so I put up a quick ipynb with an example specific to sklearn if you are interested: http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb

We can extend this idea to a "pipeline" knowledge base, or knowledge graph. Then, given a pile of data, the system can figure out some "pipeline" .

This issue is now handled because we're working directly with sklearn Pipeline objects that are persistent.

I get this error when trying to pickle the tpot model:

with open('tpot_.pkl','wb') as xx:
pickle.dump(tpot,xx)

"""PicklingError: Can't pickle : attribute lookup XGBClassifier__learning_rate on tpot.operator_utils failed"""

@woodrujm I think the issue is related to #520 about pickling TPOT object. I think, for now, entire TPOT object is not pickleable due to the attribute lookup issue. You should be able to pickle these attributes in TPOT API. We may work on this picleable issue later.

Okay, thanks for the reply. TPOT has been great so far!

@woodrujm , if you instead use pickle.dump(tpot.fitted_pipeline_,xx) it works as you might have intended.

Was this page helpful?
0 / 5 - 0 ratings