Gensim: Support streaming models split into multiple files from S3 / GCS

Created on 23 Jan 2018 · 33Comments · Source: RaRe-Technologies/gensim

Streaming small d2v models from s3 bucket works fine. Simply insert the s3 address into model.load. e.g. model.load('s3:///). However, when the model gets bigger and is split into multiple files all files except the main model file cannot be loaded. These other files are loaded by numpy and not smart_open.

The essential part of my code is

def load_model(model_file):
    return Doc2Vec.load(model_file)

# infer 
def infer_docs(input_string, model_file, inferred_docs=5):
    model = load_model(model_file)
    processed_str = simple_preprocess(input_string, min_len=2, max_len=35)    
    inferred_vector = model.infer_vector(processed_str)
    return model.docvecs.most_similar([inferred_vector], topn=inferred_docs)

Trying to load from s3 on a bugger model yields:

[INFO]  2018-01-21T20:44:59.613Z    f2689816-feeb-11e7-b397-b7ff2947dcec    testing keys in event dict
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading model from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading Doc2Vec object from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.650Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Found credentials in environment variables.
[INFO]  2018-01-21T20:44:59.707Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:44:59.801Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (2): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading wv recursively from s3://data-d2v/trained_models/model_law.wv.* with mmap=None
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading syn0 from s3://data-d2v/trained_models/model_law.wv.syn0.npy with mmap=None
[Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy': FileNotFoundError
Traceback (most recent call last):
  File "/var/task/handler.py", line 20, in infer_handler
    event['input_text'], event['model_file'], inferred_docs=10)
  File "/var/task/infer_doc.py", line 26, in infer_docs
    model = load_model(model_file)
  File "/var/task/infer_doc.py", line 21, in load_model
    return Doc2Vec.load(model_file)
  File "/var/task/gensim/models/word2vec.py", line 1569, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 282, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/var/task/gensim/models/word2vec.py", line 1593, in _load_specials
    super(Word2Vec, self)._load_specials(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 301, in _load_specials
    getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)
  File "/var/task/gensim/utils.py", line 312, in _load_specials
    val = np.load(subname(fname, attrib), mmap_mode=mmap)
  File "/var/task/numpy/lib/npyio.py", line 372, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy'

My use case is to serve the models on a AWS lambda. My current workarond is to download all model files to a local folder and then load the model from the local folder which is rather slow

bug difficulty medium impact MEDIUM reach MEDIUM

Source

JensMadsen

👍1

Most helpful comment

I would like to fix this issue.
@JensMadsen did you find a solution that works?
@menshikh-iv can you pointers to start with this?

nperera0 on 23 Aug 2019

👍3

All 33 comments

Sounds useful, thanks @JensMadsen, we know about this.
I added "feature" label, this will be really good for gensim.

P/S workaround: save the model as one file, this should work fine.

menshikh-iv on 23 Jan 2018

@menshikh-iv IIRC, over a certain array-size (2GB or maybe 4GB), single-file pickling will break, even in 64-bit Python.

The numpy load() does take a file-like object, so it might be possible to smart_open() from S3 then pass numpy load() the stream, but their docs also suggest the file-like object must support seek(), which might be an issue.

gojomo on 23 Jan 2018

@gojomo smart_open already support seek operation for S3

menshikh-iv on 23 Jan 2018

@menshikh-iv I tried that but with no succes. What I did was separately=[] but I got two files. My model is less than 1 gb.

JensMadsen on 25 Jan 2018

@JensMadsen can you show code example + names of created files?

menshikh-iv on 25 Jan 2018

Ping @JensMadsen can you share the code? separately=[] shouldn't be creating two files IMO.

piskvorky on 10 Apr 2018

Sorry @menshikh-iv and @piskvorky about the delayed response

model = gensim.models.Doc2Vec.load(filename)
model.save('./tmpdata/reduced_' + model_name, separately=[])

generates four files:

\.trainables.syn1neg.npy
\.wv.vectors.npy
\.docvecs.vectors_docs.npy
\

JensMadsen on 9 Aug 2018

If I use sep_limit=1000000 * 1024**2 i get a single file but the size is somewhat bigger than summed size of the separate files....

JensMadsen on 9 Aug 2018

That a single file is a little bigger, with pickle overhead, than the separate files isn't alone something to be concerned about. (Though as I noted previously, I believe single-file pickling breaks at some size around 2-4GB, even on 64bit Pythons.)

That even separately=[] results in multiple files may be an issue with the refactorings into subsidiary objects not adopting any separately settings from the container - that might be an inadequacy in the refactoring work, or an inherent ambiguity in how it should be handled with recursive SaveLoad.save() operations.

gojomo on 9 Aug 2018

@gojomo thanks. As described in the original post I serve the model on an AWS lambda. I cannot stream the model from an s3 bucket when the model files are split due to a bug. Therefore I try to save the model in a single file. However, aws lambdas only have 512 mb diskspace which is not sufficient for me. Therefore I have tried to use delete_temporary_training_data and then save the model but that leads to even bigger files. Is there another way to achieve smaller model files? I do not need to continue training but I need infer_vector? Best Jens

JensMadsen on 10 Aug 2018

delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything... but it should never make a model larger. (So, if you're seeing that, it may have been caused by something else.)

If you're never going to look up the vectors for those doc-tags supplied during training (as would happen for any model.docvecs.most_similar() operation), and are just using re-inferred vectors somewhere else, then you might be able to delete the model.docvecs.vectors_docs property without ill effect. If there were a lot of unique doctags in the training set, that might make a noticeable dent in model size.

If you're using plain DBOW mode during training (dm=0, dbow_words=0), then the word-vectors inside the model aren't really used, in training or later - so you might be able to delete the model.wv.vectors property without ill effect. (Or maybe even delete model.wv entirely, though it might still be consulted for maintaining the output layer, especially in negative-sampling models). But there could be problems with these approaches - test carefully in your setup – as the code hasn't consistently been designed/tested with such post-training minimization in mind.

gojomo on 13 Aug 2018

👍1

@gojomo regarding delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything: can you outline what's wrong with it and what it SHOULD do instead?

piskvorky on 13 Aug 2018

@gojomo For me to reproduce that delete_temporary_training_data() does not lead to reduced file sizes I simply

import gensim 
model = gensim.models.Doc2Vec(<filename>)
model.delete_temporary_training_data()
model.save(<new filename>)

I just reproduced it with a freshly trained model (gensim 3.5.0) with dm = 0, dbow_words = 1, vector_size = 400, windows_size = 8

JensMadsen on 13 Aug 2018

@JensMadsen I think you mean load(<filename>), as there's no load-from-file constructor. But, what were the file sizes before and after? And, what if you try re-saving to a new filename before delete_temporary_training_data(), and then a third filename after? (I suspect you may see the same expansion in the re-save, because it's something else that's causing it, perhaps a patching-up of an older/partial model upon load. And then the post-'delete' save would save a tiny amount of space, as would be expected from its defaults of not-deleting-hardly-anything.)

@piskvorky In my opinion the method shouldn't exist at all. This need isn't common-enough, or sufficiently well-supported, to justify a tempting public method. If some hatchety-with-lots-of-caveats tricks for shrinking models are important for some users, those could be documented-with-disclaimers elsewhere – maybe an 'advanced tricks' notebook, or other findable help resource.

OTOH, if such minimization is important enough to be a tested/supported feature of the models, then a larger, competent, refactoring would be justified, where the code/objects are cleanly split into the various parts needed for different steps/end-uses.

gojomo on 13 Aug 2018

OK. @menshikh-iv let's deprecate delete_temporary_training_data, along with any other half-implemented and not-so-well-thought-out methods.

I don't know if can reverse that unfortunate refactoring by @manneshiva at this point, but that may be another option. A little drastic, but maybe safer. Either way, a cleanup of the inheritance / functionality abstractions across *2vec will be necessary, it's leaking too much.

piskvorky on 13 Aug 2018

@gojomo yes sorry. I was typing out of my memory. I miss a "load". Sizes before and after are very similar but after is a bit larger, e.g. 32.1 mb vs ~32.4 mb for small test model

JensMadsen on 15 Aug 2018

@JensMadsen And what of a load() then save() without even doing a delete_temporary_training_data() in-between?

gojomo on 15 Aug 2018

@piskvorky

I don't know if can reverse that unfortunate refactoring by Shiva at this point,

Too late, no good way to do this (if we want backward compatibility), I think much work was done and better way - just try to fix current refactoring (based on user-feedback).

menshikh-iv on 16 Aug 2018

I would like to fix this issue.
@JensMadsen did you find a solution that works?
@menshikh-iv can you pointers to start with this?

nperera0 on 23 Aug 2019

👍3

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

hilaryp on 22 Sep 2020

@hilaryp yeah that looks weird. That code seems to have been added in this commit: https://github.com/RaRe-Technologies/gensim/commit/e08af7b9d91207da3db56e3e97e65f83dafb1498, something to do with loading Python2 models in Python3:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q9-how-do-i-load-a-model-in-python-3-that-was-trained-and-saved-using-python-2

I'd consider it a bug though, hardwiring separately is not the way to do it.

Will you be able to take this up & open a PR with a fix?

piskvorky on 23 Sep 2020

@piskvorky Sure, I can try to take a look soon. Thanks for pointing me to that wiki, anything else I should be aware of? and should I open a separate issue for this?

hilaryp on 25 Sep 2020

No need for a new ticket at this point – we can always create one if it turns out to be a separate (pun intended) problem.

piskvorky on 25 Sep 2020

From a quick glance, it seems that the number of places that np.load() are used in the code are limited, and could be adapted to use file-type object that read from S3 instead. So this might not be very hard to support.

gojomo on 26 Sep 2020

Reading directly would be nice – although mmap is impossible, and that's the main reason we store numpy arrays separately and use np.load().

piskvorky on 26 Sep 2020

Historically, the separate storage was also necessary to work-around pickle size limits around the 2GB/4GB thresholds. That may be gone if we move to the newer pickle whenever possible - so that might be a suggested workaround for those wanting active-models-from-S3.

gojomo on 28 Sep 2020

What do you mean by "newer pickle"?

piskvorky on 28 Sep 2020

Gensim serialization utilities seem to prefer pickle protocol v2 (PEP307, 2003, introduced in Python 2.3):

https://github.com/RaRe-Technologies/gensim/blob/e210f73c42c5df5a511ca27166cbc7d10970eab2/gensim/utils.py#L512

It will cause errors if large arrays (but not very large by modern standards) aren't stored using the separately option. (Some existing saves will even compress these 'separate' files, which ruins the potential for memory-mapping but still works around the pickle v2 problems.)

If I'm reading https://docs.python.org/3/library/pickle.html#data-stream-format correctly, pickle v4 (PEP3154, 2011, available since Python 3.4 and becoming Python's default in Python 3.8) will support larger objects without error. It might make sense to make this the default for all new saves in gensim-4.0.0 and later - and thus a possible workaround for anyone needing a single-file save of a ginormous model would be to just tell them to disable separately, and everything else will work.

(Pickle v5 adds "out of band" support which might be a better way to handle the 'separately' functionality, but is only bundled in Python 3.8 and later. But, it does have a fully-functional backport to 3.5, 3.6 and 3.7 if the benefits were large enough.)

gojomo on 28 Sep 2020

Oh yes, that makes good sense. I'm sure we used pickle v2 to support forward compatibility (model stored in Python 2.7 or Python 3.5 can be still loaded in Python 2.4, etc).

With the Python changes in Gensim 4.0 (py3.6+) we don't have to worry about that as much. For storing we must support py3.6+; for loading we should still support older Pythons (at least py2.7 supported by Gensim 3.8.3).

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

I don't know anything about "out-of-band" pickle v5, will look into it, thanks.

piskvorky on 29 Sep 2020

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could add a GENSIM_DEFAULT_PICKLE_PROTOCOL constant in gensim.utils, code-commented to explain the reason for the current choice, and have SaveLoad (and any other stray uses) generally consult that as a default. Protocol 4 seems safest, as it's already bundled with Python 3.4+, but 5 would also be an option given the availability of the pickle5 backport package.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

Tackling #2863 (a general lifecycle log inside models) would store that version on each save as one of its events. For our purposes, trusting precise versions may be OK, and will always be useful in understanding any anomalies/reports, though in many other contexts (like when web pages are customizing themselves for browsers) it's considered wiser to prefer attribute/behavior-probing over exact package/version testing, for flexibility around things like unforeseen backports/extensions/polyfills/etc.

gojomo on 5 Oct 2020

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

A scrappy work around this would be to downgrade your gensim package to the 0.13.3 version (the release before e08af7b was added), loading the model back and using separately = [] parameter to save the model to get it into a single file. You can read it back using the latest version -- this seemed to work fine for me.

saranggupta94 on 6 Oct 2020

@hilaryp More robustly, you could probably just pickle the model with pickle_protocol 4 or greater, then load it via unpickling. Unless you're either working around older-pickle size limits, or need the arrays separately (perhaps for memmapping which is irrelevant from S3), a giant single-file pickle may work just fine so the built-in save()/load() is superfluous.

gojomo on 6 Oct 2020

👍2

Thanks for the suggestion @gojomo! Our workaround was to just copy everything into /tmp but I like this solution better. I was going to look into undoing the hardwiring of separately, but after reading this thread, I don't think I'm familiar enough with the backwards-compatibility issues to take it on.

hilaryp on 21 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings