Problem:
I'm new at ML and have a problem with catboost. So, I want to predict function value (For example cos | sin etc.). I went over everything but my prediction is always straight line
Is it possible and if it is, how i can issue with my problems
I tried with polynominal function - result is the same (identical values) Can you help me with features. As I understand i need several columns with x in different degrees, but sin function is infinite set of addends
I will be glad to any comment ))
catboost version: 0.12.0
Operating System: Windows 10
CPU: AMD A9-9420 RADEON R5, 5 COMPUTE CORES 2C + 3G 3.00 GHz
The issue I think you're having is regression tree's cannot extrapolate. They'll only every return results which fall within the range of the independent variable you're fitting to.
So if you try and fit a model to a function
y = sin(2*pi*t/T)
the prediction for values of t outside the range you trained will be around the value of the highest / lowest t you trained with.
If instead you use the angle version of sin and train with an angles from 0-2*pi radians then you'll be able to approximate the full range of the function.
So say you're trying to fit a model with annual seasonality. You need a feature which is position in the year i.e. t=1-12 months (T=12), t=1-365 (T=365) for days etc. This repeats each year rather than continuously increasing.
Normally you'd create a series of sin / cos terms and increasing frequency
David, thanks a lot!
But now i have a question: why my min and max values of prediction samples are between ~ (-0,6 to 0,2)
X_train = np.array([np.arange(1, 365, 1), np.arange(1, 365, 1)])
y_train = np.array(list(map(lambda x : math.sin(2 * math.pi * x / 365), X_train[:1][0])))
X_period = np.concatenate([
np.array(np.arange(1, 365, 1)),
np.array(np.arange(1, 365, 1))
])
X_test = np.array([X_period, np.arange(365, 1093, 1)])
X_train = X_train.transpose()
X_test = X_test.transpose()
param_iterations = 1000
param_learning_rate = 0.01
param_depth = 2
param_title = 'iterations={0} | learning_rate={1} | depth={2}'
.format(param_iterations, param_learning_rate, param_depth)
model = CatBoostRegressor(
iterations=param_iterations,
learning_rate=param_learning_rate,
depth=param_depth,
verbose=False
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
plt.title(param_title)
plt.plot(preds)
plt.show()
You need to drop the 2nd column of X when you train, i.e.
model.fit(X_train[:,0], y_train)
preds = model.predict(X_test[:,0])
What's happening is you're giving catboost two features to fit with. It's basically randomly choosing 1 or the other but of the two features one generalises between train and test and the other doesn't.
This is where you want a 'pipeline' which transforms your original features into new features - i.e. a simple function which you pass your X's through before calling fit or predict.
Interestingly it's a problem I'm currently studying, as I have the same issue - electricity demand is driven by temperature, and temperature is seasonal. But if I give the model temperature and time of year as sin/cos expansion it gives the later a higher importance than temperature (in reality time of year is a proxy for other climatic variables not in the model). But I don't have enough data so whilst the training error is good, the test error is really bad.
David, thank you so much, it works))
Most helpful comment
You need to drop the 2nd column of X when you train, i.e.
model.fit(X_train[:,0], y_train)
preds = model.predict(X_test[:,0])
What's happening is you're giving catboost two features to fit with. It's basically randomly choosing 1 or the other but of the two features one generalises between train and test and the other doesn't.
This is where you want a 'pipeline' which transforms your original features into new features - i.e. a simple function which you pass your X's through before calling fit or predict.
Interestingly it's a problem I'm currently studying, as I have the same issue - electricity demand is driven by temperature, and temperature is seasonal. But if I give the model temperature and time of year as sin/cos expansion it gives the later a higher importance than temperature (in reality time of year is a proxy for other climatic variables not in the model). But I don't have enough data so whilst the training error is good, the test error is really bad.