Spacy: Pretraining with option --n-save-every still saves all models

Created on 9 Apr 2020 · 11Comments · Source: explosion/spaCy

I am running the following command:

!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT \
    --use-vectors --n-save-every 5

In order to use less space, I specified the option --n-save-every to save a model every X batches.

However, all models are still saved, with additional .temp.bin files:

$ ls -al /kaggle/working/models/tok2vec_rf_sent_sci
[snip]
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.temp.bin
[snip]

Info about spaCy

spaCy version: 2.2.4
Platform: Linux-4.19.112+-x86_64-with-debian-9.9
Python version: 3.6.6

Notebook

https://www.kaggle.com/chopeen/spacy-with-gpu-support/

enhancement feat / cli feat / serialize feat / tok2vec help wanted

Source

chopeen

👍1

Most helpful comment

This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.

svlandeg on 20 Aug 2020

🎉2

All 11 comments

Thanks for the report! I can replicate this - will look into it.

svlandeg on 10 Apr 2020

Hm, it looks like this is actually the intended behaviour: https://github.com/explosion/spaCy/pull/3510

When using spacy pretrain, the model is saved only after every epoch. But each epoch can be very big since pretrain is used for language modeling tasks. So I added a --save-every option in the CLI to save after every --save-every batches.

Note the difference between epoch and batch ! So what the --n-save-every option does, is make ADDITIONAL intermediate temp models after every X batches WITHIN an epoch.

The relevant code is this:

    for epoch in range(epoch_start, n_iter + epoch_start):
        for batch_id, batch in ... :
            ...
            if n_save_every and (batch_id % n_save_every == 0):
                _save_model(epoch, is_temp=True)
        _save_model(epoch)

The naming of the option seems confusing though... I think we should add an additional option to support your use-case.

svlandeg on 10 Apr 2020

👍1

All clear now, thank you!

I noticed a typo in documentation for preview, so I submitted a PR (#5293).

chopeen on 11 Apr 2020

👍1

Happy to hear the confusion has been cleared out!

I still think it might be an interesting addition to have an option that does what you were originally looking for - i.e. store a model only every X iterations. If you (or anyone else) feels like contributing with a PR, that would be most welcome!

svlandeg on 11 Apr 2020

Please assign the task to me. I will give a try during the weekend.

I'm wondering if option like --keep-only-when-better wouldn't make more sense, to keep a model every time loss reaches a new low.

chopeen on 14 Apr 2020

Hey @chopeen, great if you want to give it a go! We don't really officially assign tasks to anyone, but nobody else is working on it right now, so you can definitely give it a shot!

svlandeg on 14 Apr 2020

👍1

I agree that that would be a useful option: it would save disk space. You may still get a lot of models in the beginning of the training though, because usually the loss keeps dropping consistently in the first dozens of iterations at least. But you could give it a try and see how it works out.

svlandeg on 14 Apr 2020

👍1

@chopeen : I don't know wheter you've had a chance to look into this yet, but Issue #3584 and the comment here are relevant: it's probably indeed a good idea to only save the best models.

svlandeg on 11 May 2020

@svlandeg Keeping N best models is definitely a better idea than randomly saving every n-th model.

I reviewed the code a few weeks ago to see where to implement the change, but then I got swamped at work. Until the lock-down is over, this idea will need to sit on a back burner.

chopeen on 12 May 2020

That's OK, it would be a nice-to-have feature but I don't think it's urgent ;-)

svlandeg on 12 May 2020

This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.

svlandeg on 20 Aug 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings