Gensim: Structural Topic Models in gensim

Created on 7 Dec 2016  路  27Comments  路  Source: RaRe-Technologies/gensim

Hi,

I already talked with 脫lavur about this and would like to suggest adding Structural Topic Models to gensim. STM's are basically (besides other things) a generalization of author topic models, where topic proportions are affected by covariates like time, author, or other attributes. The model is becoming increasingly dominant in the world of computational social science, but I can also see it as being useful for several industry applications. In the end, we almost always have additional information other than raw text, so why not make use of it?

Right now, STM is only available via R, so we would either need a wrapper or preferably an independent implementation of it which scales reasonably well.

An introduction for STM is available here and the repository can be found here.

difficulty hard feature

Most helpful comment

I'll add my support for this as well, and my team (People Analytics at McKinsey & Co) is willing to devote some resources to make this a reality as well. I'm not sure how much time we can devote, but we'd love to help where we can.

All 27 comments

@methodds Do you know how the STM relates to supervised Latent Dirichlet Allocation (sLDA)? It sounds to me like they achieve a similar goal, but in quite different ways. sLDA is already on the wishlist for Gensim.

To be honest I have never heard of sLDA yet, so I can't give you any details on that.

I looked a bit into STMs to try to understand it. From what I understand, the use case of STMs is where you have metadata attached to your documents and you want to leverage your topic model by including this metadata. The model then generates topic distributions from a logistic-normal generalized linear model based on the metadata (covariates of the document).

Is this correct?

sLDA has a different goal from this. It aims to estimates some response variable based on the topic assignments of the words in the document. So, in a way, it is regression using LDA as features.

Actually, a combination of these two ideas seems quite reasonable.

Hi,
yes I think your interpretation of STM is correct. Combining STM and sLD脌 also seems like an interesting idea, although I think this is already kind of covered in STM's estimateEffect() function.

You can find more about this in the vignette at page 14 : Estimating metadata/topic relationships. A result of estimateEffect() is shown in figure 7, where you can actually see estimates across time (document dates are included as a covariate in the corresponding model).

That's pretty cool. You just have to plug your trained STM and a regression formula into estimateEffect. It is also much more general than sLDA because you can use any regression formula you like.

Yes it is super awesome, by far my favorite R package. From my personal experience though this gets problematic if you estimate effects for a lot of topics and have a large vocabulary + a large number of documents because you get a huuuuuge dense matrix of coefficients.

Do you know which parameter in the model is causing computational problems, in particular?

By the way, I'm having trouble finding useful literature on STMs. None of the papers about the model that I find describe the inference algorithm. There is the vignette description of the R package, the NIPS 2013 paper which lacks details. And then there are some other social science papers that I don't find particularly useful.

No sorry I don't know that, but you should contact Brandon Stewart, the person responsible for the R code of this package. He is a very nice guy and I'm sure he would be able to send you a document which better suits your needs for reconstructing how the inference algorithm works ;-)

Nice, thanks @methodds :)

CC @bstewart

Hi! I'm the Brandon Stewart mentioned above. I wanted to answer some questions that have come up here. First though, I'm honored that you guys would consider adding STM to gensim. it is an amazing package. Thanks so much @methodds

@olavurmortensen Your interpretation of STM and sLDA is correct. Also if you are still interested in a more technical paper, you might prefer this one in JASA: http://scholar.princeton.edu/sites/default/files/bstewart/files/a_model_of_text_for_experimentation_in_the_social_sciences.pdf I'm also happy to provide you some of the derivations and/or talk about how inference is set up. @methodds is right here that the algorithm as designed really hits a wall with super large data. By construction you need to maintain a dense D by K matrix of doubles (where D is the number of documents and K is the number of topics). We could discuss alternate estimation strategies that wouldn't have this problem though.

@methodds mentioned the idea that combining stm and sLDA might be taken care of with estimateEffect. Not quite though. In estimateEffect the topics are still the outcome (i.e. it answers the question how does some observed covariate in our metadata drive topical prevalence). sLDA is guiding our topics to produce topics that are predictive of both the words and the metadata. So stm is what is sometimes called an upstream metadata model (because the metadata causes the topics) and sLDA is a downstream metadata model (because the topics cause the metadata).

@olavurmortensen There are two bottlenecks worth noting. Part of the success of the model is our initialization strategy that uses the spectral method of the Arora group: http://www.jmlr.org/proceedings/papers/v28/arora13.pdf. This unfortunately involves forming a dense V by V matrix where V is the vocabulary size. This makes it impractical above say 10,000 unique words just for memory reasons. You can still run stm without this initialization though. In that scenario the two bottlenecks are the document-topic proportions (the dense D by K matrix I mentioned earlier) and the topic-word probabilities (a dense K by V matrix). In my experience it is the number of documents that grows quickly. This could be addressed in a few ways: Gibbs sampling would make the posterior representation sparse which would be helpful. We also needn't keep the D by K matrix in memory. I think it sort of depends on your preferences for handling memory issues in gensim.

Thanks a lot @bstewart !

In gensim, we keep all algos streamed, so O(D) memory is prohibitive. O(V^2) is theoretically workable, but not ideal.

On the other hand, deriving an efficient streamed algorithm could be to @olavurmortensen 's liking, he's a mathematician :)

Fun! We've toyed a bit with stochastic variational inference and so O(D) memory is definitely avoidable. If its something you guys want to move forward with, we can definitely talk some more about the best ways to approach it. It is unclear to me how to do the spectral initializations in a true streaming setting, but if you are comfortable with two passes through the documents it is very doable. One could always imagine having two options: one with the second pass and one without (which would then need to forgo the smart initialization).

Thanks. I'm not familiar with the "spectral initializations"; is it something to do with eigenvalues? If so, we have some fast streamed algos in gensim (via one-pass SVD). If not, can you point me to the algo?

I think Brandon is talking about the algorithm(s) described in the paper A Practical Algorithm for Topic Modeling with Provable Guarantees, written by the Aurora group (see his post above).

Hey sorry guys. @methodds is correct about the source.

To @piskvorky it is a non-negative matrix factorization. It is basically a two-stage thing where you have a GS-decomposition of a matrix, then you do exponentiated gradient descent a bunch of times to find some weights.

@piskvorky Sorry for the late response (to your comment about deriving a streamed algo). I won't have a lot of time to work on this, but am willing to assist in any way I can.

Hey @olavurmortensen @piskvorky @bstewart @cschwem2er, my team, people analytics at ING (Dutch bank) is really interested in getting STM working in Python, and we have gotten buy-in from our manager to use some of our work time/resources to develop the feature as a contribution to Gensim (or, alternatively, as a Python re-implementation of the R stm library). Has anyone made any progress since this thread went inactive? If so, is there a way we can pitch in to help out? If not, are there any notes or things we should know before launching into this?

Wow! No progress on my end, but this would be sooooo good!

This is great to hear. I have a postdoc @michaelzhang01 who is going to be doing some work on a scalable implementation in R and @vineetbansal who has been helping with speeding up the existing codebase in various ways. None of this is streaming

The only steps towards reimplementation in other languages that I know of are:

@jtexnl what would be helpful from me? I don't have a ton of bandwidth right now, but I'd love to see this succeed obviously.

@jtexnl hi, that's sounds good, feel free to contribute!

Has anyone made any progress since this thread went inactive?

I don't think so.

I'll add my support for this as well, and my team (People Analytics at McKinsey & Co) is willing to devote some resources to make this a reality as well. I'm not sure how much time we can devote, but we'd love to help where we can.

Checking in to see if this project is being picked up - is Gensim still participating in GSOC? I can imagine it would be a cool student project. I second everyone here that this would be a much appreciated feature in Gensim, and that STMs are widely used in Computational Social Science research.

edit: pinging @mpenkov and @piskvorky to hear if GSOC is still running and if this project is a priority for Gensim :)

Is there an open pull request where this is being worked on? Could we begin with re-implementing the R code without addressing the O(V^2) or O(D) issues, or would that be too inefficient to be included in gensim?

@policyglot and I are working on a local re-implementation of the R code... we will open a PR soon!

@bhargavvader and @policyglot: I saw you're both at UChicago. I live just north of Hyde Park in Chicago, and work in the Loop. I'm currently quarantining in California, but maybe we can connect and work together on this when I'm back in Chicago

Hey @djacques7188
Neat coincidence that you live so close by to @bhargavvader and me! Sure, would love to collaborate- even while you're in Cali.
I used R's stm library extensively in my Master's thesis repo, alongside an implementation of LDA in both scikitlearn and gensim. Hopping between R and Python was not fun. So I'm definitely in favor of integrating stm into the core gensim architecture.
Bhargav and I have finished some initial scoping from the stm documentation to discuss which functionalities to prioritize. Would you like me to add you as a collaborator to our fork?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

franciscojavierarceo picture franciscojavierarceo  路  3Comments

mikkokotila picture mikkokotila  路  3Comments

menshikh-iv picture menshikh-iv  路  4Comments

ahmedbhabbas picture ahmedbhabbas  路  4Comments

k0nserv picture k0nserv  路  3Comments