Gensim: Inferring doc topics from loaded LdaMallet fails due to missing corpus .mallet file

Created on 9 Aug 2016  路  15Comments  路  Source: RaRe-Technologies/gensim

I trained a LdaMallet model and saved the model. I'm able to load the model from disk without issue, but when I try to infer the topics of a new document, I see errors like:

2016-08-08 21:53:11,189 : INFO : serializing temporary corpus to /tmp/679c3f_corpus.txt
2016-08-08 21:53:11,190 : INFO : converting temporary corpus to MALLET format with
/<mallet_path>/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/679c3f_corpus.txt --output /tmp/679c3f_corpus.mallet.infer --use-pipe-from /tmp/679c3f_corpus.mallet
java.io.FileNotFoundException: /tmp/679c3f_corpus.mallet (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
[...]

After digging through the code, it seems that the inference is passing the --use-pipe-from flag to mallet for a .mallet object that doesn't exist. Digging a bit further, I think this .mallet file is supposed to be the data from the trained model that enables Mallet to convert the new document's tokens into the model's vocabulary. (If I remove the --use-pipe-from flag, the import works just fine.) I'm confused why the necessary second mallet file isn't getting written, which seems like a bug.

bug difficulty medium

Most helpful comment

@menshikh-iv setting the prefix to an appropriate place in my model estimation call has solved this for me in any case.

All 15 comments

@davidjurgens Thank you for reporting this. Could you post your Python code of how you train/save/load/infer the model so I could reproduce it?

I assume the same result happens when you run the import-file command from the command line.

I generated some minimum working example code and it seems the bug is much stranger than I expected. I can get it to reproducibly trigger if I do the following:

  • Train the topic model
  • Delete the mallet data stored in /tmp
  • Load the model and run the inference

When I was seeing the issue originally, I had horrific problems trying to run LdaMallet in a multiprocessing setting, which produced corrupted .mallet files (different bug though). After I cleaned up the corrupted files in /tmp, LdaMallet never worked again.

I've attached two scripts to easily reproduce this issue. Run the training script (after specifying your Mallet install path), delete all the mallet related stuff in /tmp , and then run the testing script. It should produce the error. _However,_ if you omit the deletion part and run the training script immediately followed by the test script, it will work just fine.

lda_debug_train.py.txt
lda_debug_test.py.txt

Yes, and that is the intended behaviour AFAIK.

Use the prefix parameter to specify path for Mallet to store files to, instead of temp.

Or what else should gensim do, what was the result you expected?

I guess I would expect gensim to store any necessary data for inference in the path specified by lda.save(output_lda_file) . In the current implementation, any trained model cannot be used on a different system or if the computer restarts (and clears /tmp). The python LDA implementation doesn't have this issue (afaik) because it stores all the data in various files when save is used, so I would expect LdaMallet to have the same behavior.

Also, I interpreted the description of prefix to be where the temporary files are written when gensim is converting the data to Mallet input.

Thanks, I see. It makes sense to me.

I think we could cherry pick some of the mallet files and copy them over during save (assuming they are still there when save is called). Or maybe load them even earlier, during training, into some internal binary attribute, to be saved into a file later during save. Not sure about the API.

@tmylk @cscorley comments?

I think that makes sense. There shouldn't be any reason to lose essential files. I think I'd be in favor of avoiding keeping the file in memory, unless it is typically small (I'm having trouble remembering what that file does). Not much you can do to help a user that's deleting a file before the chance to copy it during .save occurs, asides from keeping the file open for read.

Has there been any progress on this? I've just had the same issue come up.

As I know, nobody works on it now @cbjrobertson

@menshikh-iv setting the prefix to an appropriate place in my model estimation call has solved this for me in any case.

I trained a LdaMallet model and saved the model. I'm able to load the model from disk without issue, but when I try to infer the topics of a new document, I see errors like:

2016-08-08 21:53:11,189 : INFO : serializing temporary corpus to /tmp/679c3f_corpus.txt
2016-08-08 21:53:11,190 : INFO : converting temporary corpus to MALLET format with
/<mallet_path>/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/679c3f_corpus.txt --output /tmp/679c3f_corpus.mallet.infer --use-pipe-from /tmp/679c3f_corpus.mallet
java.io.FileNotFoundException: /tmp/679c3f_corpus.mallet (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
[...]

After digging through the code, it seems that the inference is passing the --use-pipe-from flag to mallet for a .mallet object that doesn't exist. Digging a bit further, I think this .mallet file is supposed to be the data from the trained model that enables Mallet to convert the new document's tokens into the model's vocabulary. (If I remove the --use-pipe-from flag, the import works just fine.) I'm confused why the necessary second mallet file isn't getting written, which seems like a bug.

I had the same issue and what I did was to save what I wanted from the model and pickle it.

@menshikh-iv setting the prefix to an appropriate place in my model estimation call has solved this for me in any case.

If you are running the model in parallel like me, make sure each model has a different prefix as it throws an exception at the end!

i make sure i explicitly set prefix to somewhere the sysadmin/system won't delete, delete the remains of my old models, and then uninstall gensim, and reinstall gensim. i would love to know a simpler way to reset where gensim expects to find these files, but once it's run, gensim mallet always expects them to be there... (even when the original model is gone!)

To summarize this a bit.

Problem: LdaMallet.save() puts multiple files into /tmp, and you need those too - not just your specifically-saved file. This is a problem when you lose your disk between runs, say running on separate server instances, or via Docker which doesn't persist disk.

Solution: sounds like ideally they were going to get all the things saved into your one file, but never got around to it (and very likely won't). So instead you'll want to specify lda = LdaMallet(..., prefix="xxx") where xxx is a directory you have control over, maybe a network mounted volume, or a volume accessible to your Docker containers.

FWIW in my case, I'm not using volumes because of a complicated setup, so I'm saving the files to Postgres and later fetching then loading those. If anyone's interested:

import os, shutil, psycopg2
from gensim.models.wrappers import LdaMallet
from sqlalchemy import create_engine

# create table models(id bigint primary key, model bytea);
engine = create_engine(DB_URL)

os.environ['MALLET_HOME'] = '/mallet-2.0.8'
mallet_path = os.environ['MALLET_HOME'] + '/bin/mallet'
lda = LdaMallet(
    mallet_path,
    ...,
    prefix=os.path.join(os.getcwd(), 'data', 'lda_')
)

def save():
    lda.save("data/lda.model")
    shutil.make_archive('lda', 'zip', 'data')
    with open('lda.zip', "rb") as f:
        bytes_ = f.read()
    bytes_ = psycopg2.Binary(bytes_)
    with engine.connect() as conn:
        sql = """
        delete from models where id=%s;
        insert into models (id, model) values (%s, %s);
        """
        id_ = 123
        conn.execute(sql, (id_, id_, bytes_))

def load():
    with engine.connect() as conn:
        sql = "select model from models where id=%s"
        id_ = 123
        obj = conn.execute(sql, (id_,)).fetchone()
        with open("lda.zip", "wb") as f:
            f.write(obj.model.tobytes())
        shutil.unpack_archive('lda.zip', 'data', 'zip')
        lda = LdaMallet.load("data/lda.model")

I found the following worked for me.

# In the training file
# Create mallet model
lda_mallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus_term_matrix, num_topics=4, id2word=dictionary)
#Save mallet model
lda_mallet.save(filepath)

# In the testing file
# Load saved mallet
lda_mallet = gensim.models.wrappers.LdaMallet.load(filepath)
# Convert to LDA
lda_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet)

malletmodel2ldamodel has known issues (https://stackoverflow.com/questions/54684552/issue-with-topic-word-distributions-after-malletmodel2ldamodel-in-gensim)

What worked for me is:

# specify the prefix path when training the model
ldamodel = gensim.models.wrappers.LdaMallet(args.mallet_path, ...., prefix='path/to/prefix')

# pickle the model
with open('path/to/model/file', 'wb') as pickle_writer:
    pickle.dump(ldamodel, pickle_writer)

# load model from pickle
with open('path/to/model/file', 'rb') as pickle_reader:
    ldamodel = pickle.load(pickle_reader)

# the following is necessary if you are working on a different environment with a prefix path 
# that is different from when you trained the model
ldamodel.prefix = 'path/to/new/prefix'
Was this page helpful?
0 / 5 - 0 ratings

Related issues

coopwilliams picture coopwilliams  路  3Comments

simonm3 picture simonm3  路  3Comments

jeradf picture jeradf  路  4Comments

Jianqiang picture Jianqiang  路  3Comments

ahmedbhabbas picture ahmedbhabbas  路  4Comments