Howdy,
I get models with 0.97 test r squared, but when I test "accuracy" with the following formula, it's something like 50%.
in_range = (y >= p - avg_mae) & (y <= p + avg_mae)
accuracy = sum(in_range) / length(in_range)
I didn't read this in the docs, but are we meant to shuffle before? Also, what specifically does subsample do? I know it randomly selects subset, but does it ignore the rest?
I calculate average test mean absolute error with 10 fold cross validation, and the error standard deviation is very low.
I am not sure about this question/issue. It seems that you used r2 in TPOT and got a good test r2 score but the custom "accuracy" score is not good. Why not using the custom "accuracy" in TPOT? Check this docs
I didn't read this in the docs, but are we meant to shuffle before? Also, what specifically does subsample do? I know it randomly selects subset, but does it ignore the rest?
Yes, it should ignore the rest. For example, setting subsample=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process.
Thanks, I will try that, but why would subsampling help? Surely more data is better?
Sometimes running more data is too slow.
Actually, my accuracy only depends on minimizing mean absolute error. Both optimizing for r squared and mean absolute error give the same mean absolute error, so I don't know that optimizing a custom scoring function would give better results.
I did not understand the equations/definition of this custom "accuracy" scoring function. Also it is strange to use "accuracy" for regression problem. Could you please explain a little more?
Sure, in this case, accuracy is just a measure of how often the actual target falls within the predicted range. The predicted range is calculated by subtracting and adding the mean absolute error from the predicted value.