Machinelearning: [Meta Issue] LDA topic word summary and other enhancements

Created on 31 Jan 2020 · 2Comments · Source: dotnet/machinelearning

This issue summarizes many user requests for exposing or adding functionality to Latent Dirichlet Allocation (LDA) model, as well as some nits.

P1: LDA model generates inconsistent result after model save and load

1004

P1: Topic Word Summary

Already existing functionality not working as advertised.

The LatentDirichletAllocation transform provides a popular topic modeling algorithm. It learns a model to categorize a document into n topics, and outputs an n-dimensional topic vector for each document, with scores for the document belonging to each topic.

Topic1  Topic2  Topic3
0.6364  0.2727  0.0909
0.5455  0.1818  0.2727

There are two use cases for this:

Use the topic vector as a featurizer for a trainer.
Predict which topic a document belongs to directly from the LDA model, for example "Cat related", "Dog related", "Tech news", etc.

(1) works fine as is, but (2) has usability issues related to interpretibility of the learned topics. The topic vector doesn't actually name the topics so it is not possible to tell what a topic actually is.

LDA has a parameter numberOfSummaryTermsPerTopic, which in theory is supposed to provide a list of most important terms for each topic. These terms would identify what a topic learned by the model actually is. However, there is no way to get this summary from the model currently, as the model parameters are not exposed, and is therefore misleading the user into thinking that this is possible.

It used to be accessible in the old PigSty API (#1411) but that has since been removed. It was also present in TLC as a text file in the model.zip, which wasn't a particularly user friendly way anyway https://github.com/dotnet/machinelearning/issues/4322#issuecomment-572333759. This should be made accessible to the user by extracting the parameters from LdaState https://github.com/dotnet/machinelearning/issues/1411#issue-374603754.

Main issue containing the discussion: #4322
Duplicate issues with the same ask: #1411 #2197 #3092 #4328 #4735

P2: Export Full Model

New feature request: #3092

This ask is to expose more model parameters than just topic word summary.

P2: Seeded LDA

New feature request: #4143

Seed each topic with a list of words, which will make the topic words converge in that direction. May not be present in LightLDA, which ML.NET wraps.

P3: Nits

LDA always prints to console: #3192

Thank you to our users who have brought these to our attention:
@hobbsa @IvanAntipov @MagicMaxxx @nukeandserve @PaulDMendoza

cc: @harishsk @justinormont @yaeldekel @antoniovs1029 @gvashishtha

API P2 enhancement usability

Source

najeeb-kazmi

👍3 ❤1

Most helpful comment

Once you complete this if you could update the documentation with an example that uses this it would be great. Thanks.

PaulDMendoza on 31 Jan 2020

👍3

All 2 comments

Assigning @gvashishtha for triage.

najeeb-kazmi on 31 Jan 2020

Once you complete this if you could update the documentation with an example that uses this it would be great. Thanks.

PaulDMendoza on 31 Jan 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

FastTree LearningRate not settable thru arguments object

daholste · 4Comments

Graphs/Plots of Evaluation Metrics

aslotte · 3Comments

Memory leak when featurizing text with the default settings

daholste · 3Comments

StochasticDualCoordinateAscentClassifier

bs6523 · 4Comments

Printing training statistics by default discussion

sfilipi · 4Comments