Hi guys, could be possible use TimeSeriesSplit and R2 score with TPOTRegressor?
Yes. Pass the TimeSeriesSplit object to the num_cv_folds parameter of TPOTRegressor. Even though the parameter is called num_cv_folds, it is passed directly to the internal calls to cross_val_score as the cv parameter, like this:
cross_val_score(features, labels, ..., cv=self.num_cv_folds)
hum nice! :)
but how about fit() and predict() functions? should i pass the "full dataframe" or slice it as X_train/X_test/y_train/y_test?
i'm trying the 'default example':
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)
should be nice include this in docs :) i could help (tell me how and i help without problems :) )
hum... I was testing fit with full dataframe (+8GB) and got one error (my computer have 16GB of ram + 10GB of swap, running linux), and all python3 process halted (checked with htop)
I pressed CONTROL+C and (I don't know why the process wasn't killed) process continue (maybe just one thread was 'canceled' with CONTROL+C?), output of terminal:
TPOTRegressor(crossover_rate=0.05, disable_update_check=False, generations=5,
max_eval_time_mins=20, max_time_mins=None, mutation_rate=0.9,
num_cv_folds=TimeSeriesSplit(n_splits=20), population_size=20,
random_state=None, scoring=None, verbosity=4)
Optimization Progress: 0%| | 0/120 [00:00, ?pipeline/s]Exception in thread Thread-41:
Traceback (most recent call last):
OSError: [Errno 12] Cannot allocate memory
^CProcess ForkPoolWorker-28:
Traceback (most recent call last):
KeyboardInterrupt
Optimization Progress: 1%|â–Š | 1/120 [12:17<24:21:57, 737.12s/pipeline]
should I post this as an issue (bug)? I liked that ^C didn't killed my process, but when an "OSError 12" occur the proccess stop, and wait a user CONTROL+C to continue, could we include a option to "on error 12 abort current pipeline" (something more "automatic")?
You should pass numpy arrays to TPOT, just as you would pass data to any scikit-learn estimator. Here is an example that reads a data set into a pandas DataFrame, cleans it, then passes the data as numpy arrays to TPOT.
I also recommend performing a train/test split, as you described, so you can validate that the model TPOT builds does indeed generalize to unseen data. TPOT uses k-fold CV internally to determine pipeline scores, so the accuracy estimates on the training data should be close to the validation accuracy.
Another advantage of performing a scikit-learn train/test split is that it shuffles your data, which can be quite important for some data sets.
nice, i think i'm a bit confuse, the train/test is just a "unseen" data, it's not a train/test data to fit() or cv, right?
Yes. You perform the train/test split before providing the training data to TPOT. Another example of that can be seen here.
Hi, I'm having similar trouble - I'm trying to get TPOT to use GroupKFold (so crossvalidation splits don't break up data from single individuals). However, I can't set the groups variable for the fold object, so it seems to be using the same folds for every generation if I send it the split:
gkf = GroupKFold(n_splits=4) #4 individuals left
groupFoldSplit = gkf.split(X_train, y_train, groups=Groups_train)
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, num_cv_folds=groupFoldSplit)
tpot.fit(X_train, y_train)
I'm guessing I should send gkf and not groupFoldSplit, but it wouldn't know about the grouping. Any ideas?
@leonfrench, looks like it's a known bug in scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/7646
It's something they will have to fix on their end. We directly use their cross_val_score interface.
We're closing all questions that haven't been updated in a while. Please feel free to re-open this issue if your issue persists.
Hi guys,
I try to create a time series split object from sklearn.model_selection and pass it into the tpot classifier input argument as cv. It seems like internal cv is still K-folds with random shuffling.
I'm dealing with time series classification problem. This might create a look ahead bias.
Is there a way to change the tpot internal cv method?
Tpotregressor has "num_cv_folds", but TpotClassifier does not.
Here is my code:
tscv = TimeSeriesSplit(n_splits=10)
tpot = TPOTClassifier(generations=4, population_size=40, verbosity=2,cv=tscv)
Hi @simonzcaiman, can you please explain why you think TPOT is still using K-fold with random shuffling from this example? We pass the object given to the cv parameter directly to the cross_val_score function, so it should work just the same.
Hi,
First of all, thanks for the awesome project. It just got into it two days ago. I'm loving it a lot.
The problem is solved as I restart my python terminal. I took a look at the source code and I see that the cv input argument is passing directly to sklearn cross_val_score function.
No problem with this. Just would like to make a suggestion by include these possible cv objects in the documentations in the base.py script.
Funny you say that---we just added that to the docs on the dev branch, so it will be out in the next release. :-)
Glad the issue was sorted out.
Wow. That's great!
One more suggestion on the minor typo error in the documentation of the "base.py"
In the available options of "scoring" argument, there is an option as 'balanced accuracy', where it should really be 'balanced_accuracy'
Thanks again for the great work here :)
Thank you for suggestion. It is already fixed in dev branch on this line
Thanks!
Most helpful comment
Yes. Pass the
TimeSeriesSplitobject to thenum_cv_foldsparameter of TPOTRegressor. Even though the parameter is callednum_cv_folds, it is passed directly to the internal calls tocross_val_scoreas thecvparameter, like this:cross_val_score(features, labels, ..., cv=self.num_cv_folds)