Tpot: TimeSeriesSplit

Created on 27 Jan 2017 · 16Comments · Source: EpistasisLab/tpot

Hi guys, could be possible use TimeSeriesSplit and R2 score with TPOTRegressor?

question

Source

rspadim

Most helpful comment

Yes. Pass the TimeSeriesSplit object to the num_cv_folds parameter of TPOTRegressor. Even though the parameter is called num_cv_folds, it is passed directly to the internal calls to cross_val_score as the cv parameter, like this:

cross_val_score(features, labels, ..., cv=self.num_cv_folds)

rhiever on 27 Jan 2017

👍3

All 16 comments

cross_val_score(features, labels, ..., cv=self.num_cv_folds)

rhiever on 27 Jan 2017

👍3

hum nice! :)
but how about fit() and predict() functions? should i pass the "full dataframe" or slice it as X_train/X_test/y_train/y_test?
i'm trying the 'default example':

tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)

should be nice include this in docs :) i could help (tell me how and i help without problems :) )

rspadim on 28 Jan 2017

hum... I was testing fit with full dataframe (+8GB) and got one error (my computer have 16GB of ram + 10GB of swap, running linux), and all python3 process halted (checked with htop)

I pressed CONTROL+C and (I don't know why the process wasn't killed) process continue (maybe just one thread was 'canceled' with CONTROL+C?), output of terminal:

TPOTRegressor(crossover_rate=0.05, disable_update_check=False, generations=5,
max_eval_time_mins=20, max_time_mins=None, mutation_rate=0.9,
num_cv_folds=TimeSeriesSplit(n_splits=20), population_size=20,
random_state=None, scoring=None, verbosity=4)
Optimization Progress: 0%| | 0/120 [00:00 Traceback (most recent call last):
OSError: [Errno 12] Cannot allocate memory

^CProcess ForkPoolWorker-28:
Traceback (most recent call last):
KeyboardInterrupt
Optimization Progress: 1%|▊ | 1/120 [12:17<24:21:57, 737.12s/pipeline]

should I post this as an issue (bug)? I liked that ^C didn't killed my process, but when an "OSError 12" occur the proccess stop, and wait a user CONTROL+C to continue, could we include a option to "on error 12 abort current pipeline" (something more "automatic")?

rspadim on 28 Jan 2017

You should pass numpy arrays to TPOT, just as you would pass data to any scikit-learn estimator. Here is an example that reads a data set into a pandas DataFrame, cleans it, then passes the data as numpy arrays to TPOT.

I also recommend performing a train/test split, as you described, so you can validate that the model TPOT builds does indeed generalize to unseen data. TPOT uses k-fold CV internally to determine pipeline scores, so the accuracy estimates on the training data should be close to the validation accuracy.

Another advantage of performing a scikit-learn train/test split is that it shuffles your data, which can be quite important for some data sets.

rhiever on 30 Jan 2017

👍1

nice, i think i'm a bit confuse, the train/test is just a "unseen" data, it's not a train/test data to fit() or cv, right?

rspadim on 31 Jan 2017

Yes. You perform the train/test split before providing the training data to TPOT. Another example of that can be seen here.

rhiever on 31 Jan 2017

Hi, I'm having similar trouble - I'm trying to get TPOT to use GroupKFold (so crossvalidation splits don't break up data from single individuals). However, I can't set the groups variable for the fold object, so it seems to be using the same folds for every generation if I send it the split:

gkf = GroupKFold(n_splits=4) #4 individuals left
groupFoldSplit = gkf.split(X_train, y_train, groups=Groups_train)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, num_cv_folds=groupFoldSplit)
tpot.fit(X_train, y_train)

I'm guessing I should send gkf and not groupFoldSplit, but it wouldn't know about the grouping. Any ideas?

leonfrench on 18 Feb 2017

@leonfrench, looks like it's a known bug in scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/7646

It's something they will have to fix on their end. We directly use their cross_val_score interface.

rhiever on 23 Feb 2017

We're closing all questions that haven't been updated in a while. Please feel free to re-open this issue if your issue persists.

rhiever on 21 Mar 2017

Hi guys,
I try to create a time series split object from sklearn.model_selection and pass it into the tpot classifier input argument as cv. It seems like internal cv is still K-folds with random shuffling.
I'm dealing with time series classification problem. This might create a look ahead bias.
Is there a way to change the tpot internal cv method?

Tpotregressor has "num_cv_folds", but TpotClassifier does not.

Here is my code:
tscv = TimeSeriesSplit(n_splits=10)
tpot = TPOTClassifier(generations=4, population_size=40, verbosity=2,cv=tscv)

simonzcaiman on 24 May 2017

Hi @simonzcaiman, can you please explain why you think TPOT is still using K-fold with random shuffling from this example? We pass the object given to the cv parameter directly to the cross_val_score function, so it should work just the same.

rhiever on 24 May 2017

Hi,

First of all, thanks for the awesome project. It just got into it two days ago. I'm loving it a lot.
The problem is solved as I restart my python terminal. I took a look at the source code and I see that the cv input argument is passing directly to sklearn cross_val_score function.
No problem with this. Just would like to make a suggestion by include these possible cv objects in the documentations in the base.py script.

simonzcaiman on 24 May 2017

Funny you say that---we just added that to the docs on the dev branch, so it will be out in the next release. :-)

Glad the issue was sorted out.

rhiever on 25 May 2017

Wow. That's great!
One more suggestion on the minor typo error in the documentation of the "base.py"
In the available options of "scoring" argument, there is an option as 'balanced accuracy', where it should really be 'balanced_accuracy'
Thanks again for the great work here :)

simonzcaiman on 25 May 2017

👍1

Thank you for suggestion. It is already fixed in dev branch on this line