Gensim: Is it possible to run LdaMallet through Colab or it must be run through command line?

Created on 28 Jun 2019  路  6Comments  路  Source: RaRe-Technologies/gensim

I have downloaded, and unzipped mallet in google drive, and I need to use it through google colab:

from google.colab import drive
drive.mount('/content/gdrive')
import os
os.environ['MALLET_HOME'] = '/content/gdrive/My Drive/mallet-2.0.8'
mallet_path = 'content/gdrive/My Drive/mallet-2.0.8/bin/mallet'
model = LdaMallet(mallet_path, corpus=common_corpus, num_topics=20, id2word=common_dictionary)

but i got this error:

CalledProcessError Traceback (most recent call last)
in ()
1 mallet_path = 'content\gdrive\My Drive\mallet-2.0.8\bin\mallet.bat'
----> 2 ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

3 frames
/usr/local/lib/python3.6/dist-packages/gensim/utils.py in check_output(stdout, popenargs, *kwargs)
1877 error = subprocess.CalledProcessError(retcode, cmd)
1878 error.output = output
-> 1879 raise error
1880 return output
1881 except KeyboardInterrupt:

CalledProcessError: Command 'content\gdrive\My Drive\mallet-2.0.8bin\mallet.bat import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /tmp/f39c83_corpus.txt --output /tmp/f39c83_corpus.mallet' returned non-zero exit status 127.

question

Most helpful comment

It is possible to run Mallet via Gensim using Google Colab - here is a repo with a working notebook with the required setup:
https://github.com/polsci/colab-gensim-mallet

All 6 comments

If you run the command shown in your error message yourself, you'll see the exact error.

My guess is you're trying something really strange, because the command shows Mallet running on Windows ("mallet.bat"), whereas your other traceback paths look like Linux.

@piskvorky Thanks for your response. As I read through other posts, I have tried both mallet.bat and mallet and same error happens. Sorry about the mismatching of the original post, here is my exact code:

import os
os.environ['MALLET_HOME'] = '/content/gdrive/My Drive/mallet-2.0.8'
mallet_path = 'content/gdrive/My Drive/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

and the error I get:

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function

'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

CalledProcessError Traceback (most recent call last)
in ()
1 mallet_path = 'content/gdrive/My Drive/mallet-2.0.8/bin/mallet'
----> 2 ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

3 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/wrappers/ldamallet.py in __init__(self, mallet_path, corpus, num_topics, alpha, id2word, workers, prefix, optimize_interval, iterations, topic_threshold)
124 self.iterations = iterations
125 if corpus is not None:
--> 126 self.train(corpus)
127
128 def finferencer(self):

/usr/local/lib/python3.6/dist-packages/gensim/models/wrappers/ldamallet.py in train(self, corpus)
265
266 """
--> 267 self.convert_input(corpus, infer=False)
268 cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '\
269 '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '\

/usr/local/lib/python3.6/dist-packages/gensim/models/wrappers/ldamallet.py in convert_input(self, corpus, infer, serialize_corpus)
254 cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
255 logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 256 check_output(args=cmd, shell=True)
257
258 def train(self, corpus):

/usr/local/lib/python3.6/dist-packages/gensim/utils.py in check_output(stdout, popenargs, *kwargs)
1877 error = subprocess.CalledProcessError(retcode, cmd)
1878 error.output = output
-> 1879 raise error
1880 return output
1881 except KeyboardInterrupt:

CalledProcessError: Command 'content/gdrive/My Drive/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /tmp/d5bf2_corpus.txt --output /tmp/d5bf2_corpus.mallet' returned non-zero exit status 127.

So, is there any issue with running mallet through Google Colab?

Yes, it looks like there is. If the reported CLI command doesn't run for you, it's an issue with your installation of Mallet (nothing to do with Gensim). Try raising your concern with the authors of Mallet.

It is possible to run Mallet via Gensim using Google Colab - here is a repo with a working notebook with the required setup:
https://github.com/polsci/colab-gensim-mallet

Thank you so much Mr. Polsci...Most helpful

@seanshir have you dealt with this issue?

Was this page helpful?
0 / 5 - 0 ratings