Hdbscan: Is there a way to save the model for future prediction?

Created on 3 Feb 2018  路  4Comments  路  Source: scikit-learn-contrib/hdbscan

Hi,

I wonder if there is a way to save the final model for future prediction? I understand that we can save a joblib object for tuning purpose that might be able to speed up the calculation but is there a way that we can just import the model back into python and use it to predict new data points without refitting the model. I am not sure if the "generate_prediction_data()" function is for this purpose and I cannot find a clear explanation of this function anywhere in the documentation.

Thanks,

Most helpful comment

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

All 4 comments

There is an approximate_predict function that can take a given model and made predictions for new data points. The generate_prediction_data needs to be run on the model before approximate_predict can work. You should be able to pickle a model and restore it later for predictions.

Now the caveat is that approximate_predict is just that -- it is an approximation based on the clusters already assigned. It will not necessarily give the same answer you would get if you added the new data points and re-clustered from scratch. Hopefully it fills your needs however.

I totally understand that predicting the cluster for new data is based on assumption that the cluster remains the same and so the approximate_predict function is exactly what I need.

So, to be clear, are the following steps correct?

fit the model (prediction_data = True) >> generate_prediction_data >> pickle the model >> make prediction later

I guess my question is which object should I pickle? Since I have already set option prediction_date = True, I believe that I don't need to run generate_prediction_data() afterward, is that correct? If I need to run the function can you give me some code example? Is it something like clusterer.generate_prediction_data()?

Thanks,

PS. I am very thankful for your contribution and always answer all the questions very quickly. Now many people in my Data Scientist team are aware of this model and they are all love it.

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

Thank you

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kevin-balkoski-enview picture kevin-balkoski-enview  路  5Comments

Phlya picture Phlya  路  15Comments

thomasht86 picture thomasht86  路  8Comments

esvhd picture esvhd  路  7Comments

chaturv3di picture chaturv3di  路  11Comments