Catboost: Catboost Regression. Function Extrapolation

Created on 7 Jan 2019 · 4Comments · Source: catboost/catboost

Problem:
I'm new at ML and have a problem with catboost. So, I want to predict function value (For example cos | sin etc.). I went over everything but my prediction is always straight line

Is it possible and if it is, how i can issue with my problems

I tried with polynominal function - result is the same (identical values) Can you help me with features. As I understand i need several columns with x in different degrees, but sin function is infinite set of addends
I will be glad to any comment ))

catboost version: 0.12.0
Operating System: Windows 10
CPU: AMD A9-9420 RADEON R5, 5 COMPUTE CORES 2C + 3G 3.00 GHz

Source

gushinyakov

Most helpful comment

You need to drop the 2nd column of X when you train, i.e.

model.fit(X_train[:,0], y_train)
preds = model.predict(X_test[:,0])

What's happening is you're giving catboost two features to fit with. It's basically randomly choosing 1 or the other but of the two features one generalises between train and test and the other doesn't.

This is where you want a 'pipeline' which transforms your original features into new features - i.e. a simple function which you pass your X's through before calling fit or predict.

Interestingly it's a problem I'm currently studying, as I have the same issue - electricity demand is driven by temperature, and temperature is seasonal. But if I give the model temperature and time of year as sin/cos expansion it gives the later a higher importance than temperature (in reality time of year is a proxy for other climatic variables not in the model). But I don't have enough data so whilst the training error is good, the test error is really bad.

david-waterworth on 9 Jan 2019

👍2

All 4 comments

The issue I think you're having is regression tree's cannot extrapolate. They'll only every return results which fall within the range of the independent variable you're fitting to.

So if you try and fit a model to a function

y = sin(2*pi*t/T)

the prediction for values of t outside the range you trained will be around the value of the highest / lowest t you trained with.

If instead you use the angle version of sin and train with an angles from 0-2*pi radians then you'll be able to approximate the full range of the function.

So say you're trying to fit a model with annual seasonality. You need a feature which is position in the year i.e. t=1-12 months (T=12), t=1-365 (T=365) for days etc. This repeats each year rather than continuously increasing.

Normally you'd create a series of sin / cos terms and increasing frequency

david-waterworth on 8 Jan 2019

👍2

David, thanks a lot!

But now i have a question: why my min and max values of prediction samples are between ~ (-0,6 to 0,2)

X_train = np.array([np.arange(1, 365, 1), np.arange(1, 365, 1)])
y_train = np.array(list(map(lambda x : math.sin(2 * math.pi * x / 365), X_train[:1][0])))

X_period = np.concatenate([
np.array(np.arange(1, 365, 1)),
np.array(np.arange(1, 365, 1))
])
X_test = np.array([X_period, np.arange(365, 1093, 1)])

X_train = X_train.transpose()
X_test = X_test.transpose()

param_iterations = 1000
param_learning_rate = 0.01
param_depth = 2
param_title = 'iterations={0} | learning_rate={1} | depth={2}'
.format(param_iterations, param_learning_rate, param_depth)

model = CatBoostRegressor(
iterations=param_iterations,
learning_rate=param_learning_rate,
depth=param_depth,
verbose=False
)
model.fit(X_train, y_train)

preds = model.predict(X_test)

plt.title(param_title)
plt.plot(preds)
plt.show()

gushinyakov on 9 Jan 2019

You need to drop the 2nd column of X when you train, i.e.

model.fit(X_train[:,0], y_train)
preds = model.predict(X_test[:,0])

This is where you want a 'pipeline' which transforms your original features into new features - i.e. a simple function which you pass your X's through before calling fit or predict.

david-waterworth on 9 Jan 2019

👍2

David, thank you so much, it works))

gushinyakov on 9 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings