Gensim: Support streaming models split into multiple files from S3 / GCS

Created on 23 Jan 2018  Â·  33Comments  Â·  Source: RaRe-Technologies/gensim

Streaming small d2v models from s3 bucket works fine. Simply insert the s3 address into model.load. e.g. model.load('s3:///). However, when the model gets bigger and is split into multiple files all files except the main model file cannot be loaded. These other files are loaded by numpy and not smart_open.

The essential part of my code is

def load_model(model_file):
    return Doc2Vec.load(model_file)

# infer 
def infer_docs(input_string, model_file, inferred_docs=5):
    model = load_model(model_file)
    processed_str = simple_preprocess(input_string, min_len=2, max_len=35)    
    inferred_vector = model.infer_vector(processed_str)
    return model.docvecs.most_similar([inferred_vector], topn=inferred_docs)

Trying to load from s3 on a bugger model yields:

[INFO]  2018-01-21T20:44:59.613Z    f2689816-feeb-11e7-b397-b7ff2947dcec    testing keys in event dict
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading model from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading Doc2Vec object from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.650Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Found credentials in environment variables.
[INFO]  2018-01-21T20:44:59.707Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:44:59.801Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (2): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading wv recursively from s3://data-d2v/trained_models/model_law.wv.* with mmap=None
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading syn0 from s3://data-d2v/trained_models/model_law.wv.syn0.npy with mmap=None
[Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy': FileNotFoundError
Traceback (most recent call last):
  File "/var/task/handler.py", line 20, in infer_handler
    event['input_text'], event['model_file'], inferred_docs=10)
  File "/var/task/infer_doc.py", line 26, in infer_docs
    model = load_model(model_file)
  File "/var/task/infer_doc.py", line 21, in load_model
    return Doc2Vec.load(model_file)
  File "/var/task/gensim/models/word2vec.py", line 1569, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 282, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/var/task/gensim/models/word2vec.py", line 1593, in _load_specials
    super(Word2Vec, self)._load_specials(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 301, in _load_specials
    getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)
  File "/var/task/gensim/utils.py", line 312, in _load_specials
    val = np.load(subname(fname, attrib), mmap_mode=mmap)
  File "/var/task/numpy/lib/npyio.py", line 372, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy'

My use case is to serve the models on a AWS lambda. My current workarond is to download all model files to a local folder and then load the model from the local folder which is rather slow

bug difficulty medium impact MEDIUM reach MEDIUM

Most helpful comment

I would like to fix this issue.
@JensMadsen did you find a solution that works?
@menshikh-iv can you pointers to start with this?

All 33 comments

Sounds useful, thanks @JensMadsen, we know about this.
I added "feature" label, this will be really good for gensim.

P/S workaround: save the model as one file, this should work fine.

@menshikh-iv IIRC, over a certain array-size (2GB or maybe 4GB), single-file pickling will break, even in 64-bit Python.

The numpy load() does take a file-like object, so it might be possible to smart_open() from S3 then pass numpy load() the stream, but their docs also suggest the file-like object must support seek(), which might be an issue.

@gojomo smart_open already support seek operation for S3

@menshikh-iv I tried that but with no succes. What I did was separately=[] but I got two files. My model is less than 1 gb.

@JensMadsen can you show code example + names of created files?

Ping @JensMadsen can you share the code? separately=[] shouldn't be creating two files IMO.

Sorry @menshikh-iv and @piskvorky about the delayed response

model = gensim.models.Doc2Vec.load(filename)
model.save('./tmpdata/reduced_' + model_name, separately=[])

generates four files:

\.trainables.syn1neg.npy
\.wv.vectors.npy
\.docvecs.vectors_docs.npy
\

If I use sep_limit=1000000 * 1024**2 i get a single file but the size is somewhat bigger than summed size of the separate files....

That a single file is a little bigger, with pickle overhead, than the separate files isn't alone something to be concerned about. (Though as I noted previously, I believe single-file pickling breaks at some size around 2-4GB, even on 64bit Pythons.)

That even separately=[] results in multiple files may be an issue with the refactorings into subsidiary objects not adopting any separately settings from the container - that might be an inadequacy in the refactoring work, or an inherent ambiguity in how it should be handled with recursive SaveLoad.save() operations.

@gojomo thanks. As described in the original post I serve the model on an AWS lambda. I cannot stream the model from an s3 bucket when the model files are split due to a bug. Therefore I try to save the model in a single file. However, aws lambdas only have 512 mb diskspace which is not sufficient for me. Therefore I have tried to use delete_temporary_training_data and then save the model but that leads to even bigger files. Is there another way to achieve smaller model files? I do not need to continue training but I need infer_vector? Best Jens

delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything... but it should never make a model larger. (So, if you're seeing that, it may have been caused by something else.)

If you're never going to look up the vectors for those doc-tags supplied during training (as would happen for any model.docvecs.most_similar() operation), and are just using re-inferred vectors somewhere else, then you might be able to delete the model.docvecs.vectors_docs property without ill effect. If there were a lot of unique doctags in the training set, that might make a noticeable dent in model size.

If you're using plain DBOW mode during training (dm=0, dbow_words=0), then the word-vectors inside the model aren't really used, in training or later - so you might be able to delete the model.wv.vectors property without ill effect. (Or maybe even delete model.wv entirely, though it might still be consulted for maintaining the output layer, especially in negative-sampling models). But there could be problems with these approaches - test carefully in your setup – as the code hasn't consistently been designed/tested with such post-training minimization in mind.

@gojomo regarding delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything: can you outline what's wrong with it and what it SHOULD do instead?

@gojomo For me to reproduce that delete_temporary_training_data() does not lead to reduced file sizes I simply

import gensim 
model = gensim.models.Doc2Vec(<filename>)
model.delete_temporary_training_data()
model.save(<new filename>)

I just reproduced it with a freshly trained model (gensim 3.5.0) with dm = 0, dbow_words = 1, vector_size = 400, windows_size = 8

@JensMadsen I think you mean load(<filename>), as there's no load-from-file constructor. But, what were the file sizes before and after? And, what if you try re-saving to a new filename before delete_temporary_training_data(), and then a third filename after? (I suspect you may see the same expansion in the re-save, because it's something else that's causing it, perhaps a patching-up of an older/partial model upon load. And then the post-'delete' save would save a tiny amount of space, as would be expected from its defaults of not-deleting-hardly-anything.)

@piskvorky In my opinion the method shouldn't exist at all. This need isn't common-enough, or sufficiently well-supported, to justify a tempting public method. If some hatchety-with-lots-of-caveats tricks for shrinking models are important for some users, those could be documented-with-disclaimers elsewhere – maybe an 'advanced tricks' notebook, or other findable help resource.

OTOH, if such minimization is important enough to be a tested/supported feature of the models, then a larger, competent, refactoring would be justified, where the code/objects are cleanly split into the various parts needed for different steps/end-uses.

OK. @menshikh-iv let's deprecate delete_temporary_training_data, along with any other half-implemented and not-so-well-thought-out methods.

I don't know if can reverse that unfortunate refactoring by @manneshiva at this point, but that may be another option. A little drastic, but maybe safer. Either way, a cleanup of the inheritance / functionality abstractions across *2vec will be necessary, it's leaking too much.

@gojomo yes sorry. I was typing out of my memory. I miss a "load". Sizes before and after are very similar but after is a bit larger, e.g. 32.1 mb vs ~32.4 mb for small test model

@JensMadsen And what of a load() then save() without even doing a delete_temporary_training_data() in-between?

@piskvorky

I don't know if can reverse that unfortunate refactoring by Shiva at this point,

Too late, no good way to do this (if we want backward compatibility), I think much work was done and better way - just try to fix current refactoring (based on user-feedback).

I would like to fix this issue.
@JensMadsen did you find a solution that works?
@menshikh-iv can you pointers to start with this?

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

@hilaryp yeah that looks weird. That code seems to have been added in this commit: https://github.com/RaRe-Technologies/gensim/commit/e08af7b9d91207da3db56e3e97e65f83dafb1498, something to do with loading Python2 models in Python3:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q9-how-do-i-load-a-model-in-python-3-that-was-trained-and-saved-using-python-2

I'd consider it a bug though, hardwiring separately is not the way to do it.

Will you be able to take this up & open a PR with a fix?

@piskvorky Sure, I can try to take a look soon. Thanks for pointing me to that wiki, anything else I should be aware of? and should I open a separate issue for this?

No need for a new ticket at this point – we can always create one if it turns out to be a separate (pun intended) problem.

From a quick glance, it seems that the number of places that np.load() are used in the code are limited, and could be adapted to use file-type object that read from S3 instead. So this might not be very hard to support.

Reading directly would be nice – although mmap is impossible, and that's the main reason we store numpy arrays separately and use np.load().

Historically, the separate storage was also necessary to work-around pickle size limits around the 2GB/4GB thresholds. That may be gone if we move to the newer pickle whenever possible - so that might be a suggested workaround for those wanting active-models-from-S3.

What do you mean by "newer pickle"?

Gensim serialization utilities seem to prefer pickle protocol v2 (PEP307, 2003, introduced in Python 2.3):

https://github.com/RaRe-Technologies/gensim/blob/e210f73c42c5df5a511ca27166cbc7d10970eab2/gensim/utils.py#L512

It will cause errors if large arrays (but not very large by modern standards) aren't stored using the separately option. (Some existing saves will even compress these 'separate' files, which ruins the potential for memory-mapping but still works around the pickle v2 problems.)

If I'm reading https://docs.python.org/3/library/pickle.html#data-stream-format correctly, pickle v4 (PEP3154, 2011, available since Python 3.4 and becoming Python's default in Python 3.8) will support larger objects without error. It might make sense to make this the default for all new saves in gensim-4.0.0 and later - and thus a possible workaround for anyone needing a single-file save of a ginormous model would be to just tell them to disable separately, and everything else will work.

(Pickle v5 adds "out of band" support which might be a better way to handle the 'separately' functionality, but is only bundled in Python 3.8 and later. But, it does have a fully-functional backport to 3.5, 3.6 and 3.7 if the benefits were large enough.)

Oh yes, that makes good sense. I'm sure we used pickle v2 to support forward compatibility (model stored in Python 2.7 or Python 3.5 can be still loaded in Python 2.4, etc).

With the Python changes in Gensim 4.0 (py3.6+) we don't have to worry about that as much. For storing we must support py3.6+; for loading we should still support older Pythons (at least py2.7 supported by Gensim 3.8.3).

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

I don't know anything about "out-of-band" pickle v5, will look into it, thanks.

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could add a GENSIM_DEFAULT_PICKLE_PROTOCOL constant in gensim.utils, code-commented to explain the reason for the current choice, and have SaveLoad (and any other stray uses) generally consult that as a default. Protocol 4 seems safest, as it's already bundled with Python 3.4+, but 5 would also be an option given the availability of the pickle5 backport package.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

Tackling #2863 (a general lifecycle log inside models) would store that version on each save as one of its events. For our purposes, trusting precise versions may be OK, and will always be useful in understanding any anomalies/reports, though in many other contexts (like when web pages are customizing themselves for browsers) it's considered wiser to prefer attribute/behavior-probing over exact package/version testing, for flexibility around things like unforeseen backports/extensions/polyfills/etc.

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

A scrappy work around this would be to downgrade your gensim package to the 0.13.3 version (the release before e08af7b was added), loading the model back and using separately = [] parameter to save the model to get it into a single file. You can read it back using the latest version -- this seemed to work fine for me.

@hilaryp More robustly, you could probably just pickle the model with pickle_protocol 4 or greater, then load it via unpickling. Unless you're either working around older-pickle size limits, or need the arrays separately (perhaps for memmapping which is irrelevant from S3), a giant single-file pickle may work just fine so the built-in save()/load() is superfluous.

Thanks for the suggestion @gojomo! Our workaround was to just copy everything into /tmp but I like this solution better. I was going to look into undoing the hardwiring of separately, but after reading this thread, I don't think I'm familiar enough with the backwards-compatibility issues to take it on.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

johann-petrak picture johann-petrak  Â·  3Comments

mikkokotila picture mikkokotila  Â·  3Comments

franciscojavierarceo picture franciscojavierarceo  Â·  3Comments

sairampillai picture sairampillai  Â·  3Comments

simonm3 picture simonm3  Â·  3Comments