I got this error "AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk'" when trying to save Japanese NER model in spaCy 2.0.2. Can you guys help me to fix this error? Thanks so much
Thanks for the report! The reason this happens is that the Japanese tokenizer is a custom implementation via the Janome library and not using spaCy's serializable Tokenizerclass. So when you call nlp.to_disk(), spaCy will call the to_disk() methods of all pipeline components and the tokenizer – which fails in this case.
Possible solutions for now:
__init__.py to load your pickle file instead of initialising the model the standard way (see here for details).nlp.tokenizer = None before saving out the model. This is not so nice – but in this case, it shouldn't really matter, since spaCy currently doesn't ship with any Japanese language data that you'd want to serialize with the model anyways. (Haven't tested this approach, but pretty sure it works!)We should probably allow disabling the tokenizer via the disable keyword argument on Language.to_disk, too (which is currently only possible for pipeline components). Will think about the best way to solve this.
Btw, curious to hear about your results on training Japanese NER – sounds very exciting!
@ines Thanks so much for your quick reply. I'll try your solution and give you my feedback on training Japanese NER :)
Hi @ines,
Sorry for bothering you. I tried your solution but got this error using Pickle “AttributeError: Can't pickle local object 'FeatureExtracter.
# save model to output directory
print("Saving model...")
nlp.tokenizer = None
ner_model = pickle.dumps(nlp)
Moreover, I tried to build simple Chinese and Thai NER models, and I could save these models successfully using nlp.to_disk method. I wonder if there is something wrong with Japanese in spaCy ver 2.0.2 that we got this error “AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk' ”
Can you help me out this trouble. Thanks so much.
Hmm, this is strange! I think the difference between Japanese and Thai/Chinese is that it provides a create_tokenizer method (see here), while the others only overwrite the make_doc (see here).
What happens if you don't use Pickle and the regular nlp.to_disk() method, but tell it to disable the tokenizer, for example:
nlp.to_disk('/path/to/model', disable=['tokenizer'])
If this works, the only problem here is that you'll also need to set disable=['tokenizer'] when you load the model back in using nlp.from_disk(). So packaging a model and loading it via spacy.load() won't work out-of-the-box.
We'll think about a good way to solve this in the future. When saving out a model, spaCy should probably check if the tokenizer is serializable and if not, show a warning, but serialize anyway.
Nice to hear that Chinese and Thai worked well – this is really cool!
@ines
Oh yeah, It worked. I could save the Japanese NER model successfully. Thanks so much :)
Just pushed a fix to Japanese that implements "dummy" serialization methods on the tokenizer to prevent the error. I also found another small bug that caused the Japanese vocab to not set the lang correctly (meaning that the saved out model's meta.json had "lang": "" set, which causes an error when loading the model back in).
Just tested it locally and both to/from disk and to/from bytes now works correctly. This means you should also be able to package your Japanese model as a Python package using the spacy package command.
I have a similar problem that I could not fix: I've trained a custom NER model that I'd like to save to the disk, and since I'm using a custom tokenizer I don't want to save the tokenizer. Here's what I did:
import spacy
nlp = spacy.load("en")
nlp.tokenizer = some_custom_tokenizer
# Train the NER model...
nlp.tokenizer = None
nlp.to_disk('/tmp/my_model', disable=['tokenizer'])
(Due to this thread I did not packaged the model)
When I try to load it, the pipeline is empty, and surprisingly, is has the default spaCy tokenizer.
nlp = spacy.blank('en').from_disk('/tmp/model', disable=['tokenizer'])
I need to load the model without the tokenizer but with the full pipeline. Any ideas? thanks.
More about this issue: when I tried to load the model like this:
loaded_nlp = spacy.load('/model/directory', disable=['tokenizer'])
I got an error:
FileNotFoundError: [Errno 2] No such file or directory: '/model/directory/tokenizer'
I looked at the code of util.load_model_from_path and I think I found a bug there. Line 158 is:
return nlp.from_disk(model_path)
If the disable parameter will be added to the call, it will be possible to directly use spacy.load for loading models without specific parts:
return nlp.from_disk(model_path, disable)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.