Spacy: Cannot save a model to disk if I add a tagger component

Created on 12 Jun 2020  路  3Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

Hello!
I tried to run the following code:

import spacy

nlp = spacy.blank("fr")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(nlp.create_pipe("tagger"))
nlp.add_pipe(nlp.create_pipe("parser"))
nlp.to_disk('my\folder\')

and I got the following error message:

  File "my\folder/create_blank_model.py", line 18, in <module>
    nlp.to_disk(PATH + '/../models/test_model')
  File "my\folder\v_env\lib\site-packages\spacy\language.py", line 911, in to_disk
    util.to_disk(path, serializers, exclude)
  File "my\folder\v_env\lib\site-packages\spacy\util.py", line 645, in to_disk
    writer(path / key)
  File "my\folder\v_env\lib\site-packages\spacy\language.py", line 909, in <lambda>
    serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
  File "pipes.pyx", line 632, in spacy.pipeline.pipes.Tagger.to_disk
  File "my\folder\v_env\lib\site-packages\spacy\util.py", line 645, in to_disk
    writer(path / key)
  File "pipes.pyx", line 628, in spacy.pipeline.pipes.Tagger.to_disk.lambda22
TypeError: Required argument 'length' (pos 1) not found

Process finished with exit code 1

I must admit that I do not understand what is this 'length' argument that I should add.
Moreover, if I run the code without the line adding the tagger, as in:

import spacy

nlp = spacy.blank("fr")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(nlp.create_pipe("parser"))
nlp.to_disk('my\folder\')

then all works smoothly.
I came across a workaround, that is to load a model (the fr_core_news_md), but for my current project I need to train the NER and Text Classifier of a blank model, and then I must also use the tagger.

Your Environment

  • spaCy version: 2.2.4
  • Platform: Windows-10-10.0.17134-SP0
  • Python version: 3.6.5
bug feat / serialize feat / tagger

All 3 comments

Thanks for the report, that does look like a bug!

I'm guessing this only happens when the model in your tagger is not actually instantiated yet. The model would be instantiated when you call begin_training. You can experiment whether that fixes the issue for now. If that's the case, you wouldn't run into any issues when you actually do create a model, train it, and then store it.

But I think we should be able to make this code more robust, as well.

Thank you for the answer, the solution you provided works perfectly!

From v.3 onwards, the pipeline components will always need to have a model internally, so you won't be able to get into this weird state where you have a pipeline but it's not really initialized yet. In almost all use-cases, you'll store the component only after training a model for it, so you wouldn't run into the original error you described. So I think it's not too big a deal for the current v.2.x versions, and it looks like you can continue working too :-) If OK, I'll go ahead and close this.

Was this page helpful?
0 / 5 - 0 ratings