Hdbscan: Is there a way to save the model for future prediction?

Created on 3 Feb 2018 · 4Comments · Source: scikit-learn-contrib/hdbscan

Hi,

I wonder if there is a way to save the final model for future prediction? I understand that we can save a joblib object for tuning purpose that might be able to speed up the calculation but is there a way that we can just import the model back into python and use it to predict new data points without refitting the model. I am not sure if the "generate_prediction_data()" function is for this purpose and I cannot find a clear explanation of this function anywhere in the documentation.

Thanks,

Source

econkc

Most helpful comment

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

lmcinnes on 4 Feb 2018

👍5

All 4 comments

There is an approximate_predict function that can take a given model and made predictions for new data points. The generate_prediction_data needs to be run on the model before approximate_predict can work. You should be able to pickle a model and restore it later for predictions.

Now the caveat is that approximate_predict is just that -- it is an approximation based on the clusters already assigned. It will not necessarily give the same answer you would get if you added the new data points and re-clustered from scratch. Hopefully it fills your needs however.

lmcinnes on 4 Feb 2018

I totally understand that predicting the cluster for new data is based on assumption that the cluster remains the same and so the approximate_predict function is exactly what I need.

So, to be clear, are the following steps correct?

fit the model (prediction_data = True) >> generate_prediction_data >> pickle the model >> make prediction later

I guess my question is which object should I pickle? Since I have already set option prediction_date = True, I believe that I don't need to run generate_prediction_data() afterward, is that correct? If I need to run the function can you give me some code example? Is it something like clusterer.generate_prediction_data()?

Thanks,

PS. I am very thankful for your contribution and always answer all the questions very quickly. Now many people in my Data Scientist team are aware of this model and they are all love it.

econkc on 4 Feb 2018

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

lmcinnes on 4 Feb 2018

👍5

Thank you

econkc on 4 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings