Please use this issue to discuss the possibility of porting TPOT over to TensorFlow.
TensorFlow seems promising for speeding up the pipeline optimization procedure of TPOT. I've heard reports of 800-fold speedup on a single core. If this is true, we could probably drop multiprocessing support (which has been problematic) and move over to TensorFlow.
Here's an example of TensorFlow for sklearn: https://github.com/tensorflow/tensorflow/blob/fa4ba830f437fdb9dc1085b4d68a3bab41a16e20/tensorflow/examples/learn/iris_with_pipeline.py
I think the 800-fold speedup may use GPU.
The example seems not use TensorFlow to speed up the whole pipeline but only for the learn.DNNClassifier step.
I think the Distributed Computing in TensorFlow is very promising. Here are some links about it:
http://ischlag.github.io/2016/06/12/async-distributed-tensorflow/
http://learningtensorflow.com/lesson11/
I think this is a feasible change to make - all the info really needed is what files need to have tensorflow analogs inserted. Right now it looks like config_regression.py and config_classifier.py are the files that needs to have the relevant tf models added. I haven't used tpot before, but I should be able to make the changes relatively easily if the first post is edited to include a checklist and milestones, so that I make sure I'm understanding correctly what all.
It would also be a good idea to decide what version of TensorFlow to use as a base since (as I think someone mentioned on twitter) the API has undergone a lot of changes since its initial release. Things have stabilized a bit now that it's past 1.0.0, but it's still possible some breaking changes might be introduced in the future. Right now the stable version is 1.1.0, but there's already a release candidate for 1.2. Which should we be aiming to guarantee support for?
Do you mean there is a way to warp these scikit-learn operators in both config files? If so, please check operator_utils.py.
I am thinking about replacing the joblib in base.py since they have issues when dealing with large dataset. (#436)
Maybe we should support the current stable version 1.1.
I'm 馃憤 on using the latest version of TF. What's the ETA for 1.2?
I'm thinking TF integration would come out in TPOT 0.9, as we're ramping up to roll out all of the latest features in a 0.8 release soon.
@weixuanfu2016 Thanks for the suggestion. I'll try to familiarize myself with the project structure before making any potential contributions. I'll do my best to read and absorb the documentation on the Github Page, but if the two of you have any other recommendations for the best route to become familiar with TPOT's code base, I welcome the suggestions.
TensorFlow 1.2rc0 was just "announced" on twitter this morning by some of the devs. Based on past releases, that would suggest 1.2.0 will be releases by June, perhaps as early as next week. There are no breaking changes between 1.1 and 1.2 that look relevant to TPOT. More interestingly, though, is that 1.2 is the first version that will include tf.contrib.data, which is a tool meant specifically for pipelining datasets by introducing a Dataset and Iterator abstraction to TensorFlow. tf.contrib libraries aren't officially supported, but the most used ones are frequently integrated into the main library, and this looks to be one that'll follow that path. I think sticking 1.1 for now is good for the initial changes, but the TPOT team should keep an eye on TensorFlow's Dataset API for potential uses.
You might have a look at Karoo GP for ideas on integrating TensorFlow: https://github.com/kstaats/karoo_gp/blob/master/karoo_gp_base_class.py
I came across similar thought that I shared with @rhiever today. I am not familiar with TPOT at all but my initial/naive idea would be dropping TPOT's multiprocessing and use TensorFlow's Estimator API that provides simple interface similar to sklearn. Then allow TPOT to specify the RunConfig class for Estimator that has all the specifications for distributed training.
There has been a lot of changes to TF Estimator (many of them are now in core so they are more stable now) so I am not sure if it's still working with sklearn's pipeline. I'll take a look at the codebase for TPOT soon and hopefully propose a better solution (my last comment assumes no knowledge in TPOT but hopefully could help move this discussion along from TF perspective).
馃憤 Excited to collaborate on this issue, @terrytangyuan!
Note that the part we're thinking of using TF for is the pipeline evaluations, which happens here. Essentially, every iteration of TPOT runs the sklearn cross_val_score function on valid sklearn Pipelines, which can contain one or many sklearn estimators and transformers.
We would definitely drop multiprocessing support (which has had many bugs!) if we could run sklearn on TF.
@terrytangyuan I think tf.contrib.learn.BaseEstimator is worthy to look
into. https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/contrib/learn/python/learn/estimators/estimator.py
I agree to drop multiprocessing in TPOT and use TensorFlow but many scikit-learn's estimator using multiprocessing backend in joblib, especially for those ensemble method. (check this example). It is computationally expensive part within some operators. So apply parallel computing in TPOT only may not speed up a lot since TPOT need evaluate these scikit-learn pipelines. I think the ideal way is to warp some scikit-learn operators by TensorFlow like this SVM, which can speed up scikit-learn's end
@weixuanfu2016 Yes that's exactly the example canned estimators I was referring to that has sklearn-like interface. They will be moved to TensorFlow core soon but it will take some time. The only stable ones can be found here. For each estimator, you can set their configuration that contains all the distributed logics, e.g. master, workers, parameter servers. However, this might not be something @rhiever talked about. I need to look into the code a little bit more.
@terrytangyuan Thank you for these information, These TF.learn estimators looks very promising!
@terrytangyuan: Do I understand right that, to support the models, transformers, etc. in scikit-learn, we would need to rewrite all of the algorithms in a tflearn-like format?
@rhiever No I don't think that's feasible. I am thinking it would just be good to support existing TF Estimators (other people from Google and the community will open-source more estimators) in a similar fashion to supporting sklearn estimators. All the distributed logics are inside TF Estimators instead of in TPOT since those are hard to implement and maintain. TPOT would focus on providing genetic programming to both sklearn and TF Estimators.
Am still just getting going with TF and kind of fairly quickly run into the issue of Parameter tuning and model selection (activation, learning rate, regularization and the like..) Started looking for patterns to follow, and came here hoping... :-) I'm not smart enough to grapple with it but would be happy to try and help once some direction is established. It does seem that the TF. Estimator class is getting the main features and love from Google.. I'm currently working with TF1.3 (and with GPU)
Unfortunately I haven't invested time in this yet since I don't think it's ready yet. But again tpot needs to remove multiprocessing support if haven't done so. I will try revisit this once Experiment API in TensorFlow is landed in core (currently in contrib.learn.estimators). The LearnRunner and the tune() method will make parameter tuning easier.
Have a look at the H2O4GPU project here:
https://github.com/h2oai/h2o4gpu
It provides a "drop-in replacement for sklearn".
I was able to get it working in TPOT, however ran into the problem that when you replace some optimizers/estimators with GPU versions, the total run-time of TPOT becomes dominated by:
The way to address 1 is to minimize non-accelerated computation by replacing more components with accelerated versions and also removing non-accelerated candidates from the config.
The way to address 2 is to somehow know when to use GPU and when to use CPU based on each algorithm and dataset.
@gb96 Thank you for this finding. I like the 1st solution about using a GPU_accelerated config. It can be a optional config for users with GPU resources.
Most helpful comment
@weixuanfu2016 Thanks for the suggestion. I'll try to familiarize myself with the project structure before making any potential contributions. I'll do my best to read and absorb the documentation on the Github Page, but if the two of you have any other recommendations for the best route to become familiar with TPOT's code base, I welcome the suggestions.
TensorFlow 1.2rc0 was just "announced" on twitter this morning by some of the devs. Based on past releases, that would suggest 1.2.0 will be releases by June, perhaps as early as next week. There are no breaking changes between 1.1 and 1.2 that look relevant to TPOT. More interestingly, though, is that 1.2 is the first version that will include
tf.contrib.data, which is a tool meant specifically for pipelining datasets by introducing aDatasetandIteratorabstraction to TensorFlow.tf.contriblibraries aren't officially supported, but the most used ones are frequently integrated into the main library, and this looks to be one that'll follow that path. I think sticking 1.1 for now is good for the initial changes, but the TPOT team should keep an eye on TensorFlow's Dataset API for potential uses.