Hi,
I wonder if there is a way to save the final model for future prediction? I understand that we can save a joblib object for tuning purpose that might be able to speed up the calculation but is there a way that we can just import the model back into python and use it to predict new data points without refitting the model. I am not sure if the "generate_prediction_data()" function is for this purpose and I cannot find a clear explanation of this function anywhere in the documentation.
Thanks,
There is an approximate_predict
function that can take a given model and made predictions for new data points. The generate_prediction_data
needs to be run on the model before approximate_predict
can work. You should be able to pickle a model and restore it later for predictions.
Now the caveat is that approximate_predict
is just that -- it is an approximation based on the clusters already assigned. It will not necessarily give the same answer you would get if you added the new data points and re-clustered from scratch. Hopefully it fills your needs however.
I totally understand that predicting the cluster for new data is based on assumption that the cluster remains the same and so the approximate_predict function is exactly what I need.
So, to be clear, are the following steps correct?
fit the model (prediction_data = True) >> generate_prediction_data >> pickle the model >> make prediction later
I guess my question is which object should I pickle? Since I have already set option prediction_date = True, I believe that I don't need to run generate_prediction_data() afterward, is that correct? If I need to run the function can you give me some code example? Is it something like clusterer.generate_prediction_data()?
Thanks,
PS. I am very thankful for your contribution and always answer all the questions very quickly. Now many people in my Data Scientist team are aware of this model and they are all love it.
The following code would work:
model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)
If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.
Thank you
Most helpful comment
The following code would work:
If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.