I am running the following command:
!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT \
--use-vectors --n-save-every 5
In order to use less space, I specified the option --n-save-every to save a model every X batches.
However, all models are still saved, with additional .temp.bin files:
$ ls -al /kaggle/working/models/tok2vec_rf_sent_sci
[snip]
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model110.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model110.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model111.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model111.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model112.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model112.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model113.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model113.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model114.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model114.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model115.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model115.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model116.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model116.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model117.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model117.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model118.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model118.temp.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model119.bin
-rw-r--r-- 1 root root 3889626 Apr 9 08:49 model119.temp.bin
[snip]
Thanks for the report! I can replicate this - will look into it.
Hm, it looks like this is actually the intended behaviour: https://github.com/explosion/spaCy/pull/3510
When using spacy pretrain, the model is saved only after every epoch. But each epoch can be very big since pretrain is used for language modeling tasks. So I added a --save-every option in the CLI to save after every --save-every batches.
Note the difference between epoch and batch ! So what the --n-save-every option does, is make ADDITIONAL intermediate temp models after every X batches WITHIN an epoch.
The relevant code is this:
for epoch in range(epoch_start, n_iter + epoch_start):
for batch_id, batch in ... :
...
if n_save_every and (batch_id % n_save_every == 0):
_save_model(epoch, is_temp=True)
_save_model(epoch)
The naming of the option seems confusing though... I think we should add an additional option to support your use-case.
All clear now, thank you!
I noticed a typo in documentation for preview, so I submitted a PR (#5293).
Happy to hear the confusion has been cleared out!
I still think it might be an interesting addition to have an option that does what you were originally looking for - i.e. store a model only every X iterations. If you (or anyone else) feels like contributing with a PR, that would be most welcome!
Please assign the task to me. I will give a try during the weekend.
I'm wondering if option like --keep-only-when-better wouldn't make more sense, to keep a model every time loss reaches a new low.
Hey @chopeen, great if you want to give it a go! We don't really officially assign tasks to anyone, but nobody else is working on it right now, so you can definitely give it a shot!
I agree that that would be a useful option: it would save disk space. You may still get a lot of models in the beginning of the training though, because usually the loss keeps dropping consistently in the first dozens of iterations at least. But you could give it a try and see how it works out.
@chopeen : I don't know wheter you've had a chance to look into this yet, but Issue #3584 and the comment here are relevant: it's probably indeed a good idea to only save the best models.
@svlandeg Keeping N best models is definitely a better idea than randomly saving every n-th model.
I reviewed the code a few weeks ago to see where to implement the change, but then I got swamped at work. Until the lock-down is over, this idea will need to sit on a back burner.
That's OK, it would be a nice-to-have feature but I don't think it's urgent ;-)
This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.
Most helpful comment
This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.