Tpot: TPOT is not performing well for Timeseries Data

Created on 20 Jun 2018  路  17Comments  路  Source: EpistasisLab/tpot

I am pulling out data from the Quandl API, and using the stock data to predict future stock prices, the model is a linear model and fits well with training data, but provides a straight line for future data. I have converted the dates to a standard int format before processing so I don't know what I am doing wrong :( , any help is appreciated: like code examples and stuff.....

Process to reproduce the issue

[ordered list the process to finding and recreating the issue, example below]

  1. Pull Quandl Data
  2. Convert Dates to numeric format
  3. Use the TPOT regressor to fit
  4. Try predicting future data points and plotting but recieve a linear output :(
question

Most helpful comment

FYI that looks like a massively overfit model. It probably outputs a flat line on new data because that is the last value is learned to predict at the final time point. You definitely need more features to predict stock price here, but that is outside the purview of TPOT support.

I suggest Googling "python stock price prediction" and there will be dozens of articles covering the topic, including how to integrate additional features into the predictive model.

All 17 comments

@rhiever

Are you performing cross-validation correctly? e.g. using TimeSeriesSplit?

tpot = TPOTRegressor(generations=5, population_size=50,verbosity=2, cv=TimeSeriesSplit(n_splits=15))

This is the way I am calling the regressor, and then I am using tpot.fitted_pipeline_ for making predictions...

How many samples do you have in the time series? How many features?

My next suggestion is to give TPOT more time to explore more pipelines. That entails increasing the generations and population_size parameters. I recommend 100 and 100 for both parameters to start with, and give TPOT plenty of time (and patience :-) ) to work.

[3721 rows x 2 columns]; 1 column for dates the other one is target......

Gotcha. Is there a pattern based on the date in the time series? Otherwise you'll likely need more features to build an effective predictive model.

download

The model fits the training data, but on unseen data it is linear

Are there other features except of date in your input features (X)?

I am afraid not, how do I incorporate more features in stock data?

Stock price can鈥檛 be predicted by date only. I think you need merge more features (if there are any from that API) with date.

FYI that looks like a massively overfit model. It probably outputs a flat line on new data because that is the last value is learned to predict at the final time point. You definitely need more features to predict stock price here, but that is outside the purview of TPOT support.

I suggest Googling "python stock price prediction" and there will be dozens of articles covering the topic, including how to integrate additional features into the predictive model.

Thanks! @rhiever @weixuanfu

FYI that looks like a massively overfit model. It probably outputs a flat line on new data because that is the last value is learned to predict at the final time point. You definitely need more features to predict stock price here, but that is outside the purview of TPOT support.

I suggest Googling "python stock price prediction" and there will be dozens of articles covering the topic, including how to integrate additional features into the predictive model.

  1. can we use time series with TPOT and what transformations could we do for this dateTime column
  2. for cat茅gorical data should we transform the to numerical or TPOT do that for us
  3. for missing values in numerical cases, we don't need to handle them, is TPOT do these transformations for us.

Thank u for anyone who will help me with one or all question, I'm working on my graduation project and I need these answers

I am pulling out data from the Quandl API, and using the stock data to predict future stock prices, the model is a linear model and fits well with training data, but provides a straight line for future data. I have converted the dates to a standard int format before processing so I don't know what I am doing wrong :( , any help is appreciated: like code examples and stuff.....

Process to reproduce the issue

[ordered list the process to finding and recreating the issue, example below]

  1. Pull Quandl Data
  2. Convert Dates to numeric format
  3. Use the TPOT regressor to fit
  4. Try predicting future data points and plotting but recieve a linear output :(

how did you transform your column to numerical and what type has before this transformation please

1. can we use time series with TPOT and what transformations could we do for this dateTime column

The input dataset should sort based on date time column.

2. for cat茅gorical data should we transform the to numerical or TPOT do that for us

Yes, the categorical should be transform to numerical. You may try OrdinalEncoder or OneHotEncoder in scikit-learn

3. for missing values in numerical cases, we don't need to handle them, is TPOT do these transformations for us.

TPOT should use SimpleImputer from scikit-learn to impute the missing values.

1. can we use time series with TPOT and what transformations could we do for this dateTime column

The input dataset should sort based on date time column.

2. for cat茅gorical data should we transform the to numerical or TPOT do that for us

Yes, the categorical should be transform to numerical. You may try OrdinalEncoder or OneHotEncoder in scikit-learn

3. for missing values in numerical cases, we don't need to handle them, is TPOT do these transformations for us.

TPOT should use SimpleImputer from scikit-learn to impute the missing values.

thank u so much for your reply
but for Time series, I would make this column as an index sort it and transform to int so I can use it as an input for TPOT

Was this page helpful?
0 / 5 - 0 ratings