Spacy: Save Japanese NER model by using nlp.to_disk

Created on 13 Nov 2017 · 9Comments · Source: explosion/spaCy

I got this error "AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk'" when trying to save Japanese NER model in spaCy 2.0.2. Can you guys help me to fix this error? Thanks so much

Environment

Operating System: Win 10 Pro
Python Version Used: 3.6.3
spaCy Version Used: 2.0.2

lang / ja

Source

buivietan

All 9 comments

Thanks for the report! The reason this happens is that the Japanese tokenizer is a custom implementation via the Janome library and not using spaCy's serializable Tokenizerclass. So when you call nlp.to_disk(), spaCy will call the to_disk() methods of all pipeline components and the tokenizer – which fails in this case.

Possible solutions for now:

Use Pickle. This is probably the easiest and safest way – but if you're looking to create a spaCy model package, you'd have to modify the package's __init__.py to load your pickle file instead of initialising the model the standard way (see here for details).
Set nlp.tokenizer = None before saving out the model. This is not so nice – but in this case, it shouldn't really matter, since spaCy currently doesn't ship with any Japanese language data that you'd want to serialize with the model anyways. (Haven't tested this approach, but pretty sure it works!)

We should probably allow disabling the tokenizer via the disable keyword argument on Language.to_disk, too (which is currently only possible for pipeline components). Will think about the best way to solve this.

Btw, curious to hear about your results on training Japanese NER – sounds very exciting!

ines on 13 Nov 2017

@ines Thanks so much for your quick reply. I'll try your solution and give you my feedback on training Japanese NER :)

buivietan on 13 Nov 2017

👍1

Hi @ines,
Sorry for bothering you. I tried your solution but got this error using Pickle “AttributeError: Can't pickle local object 'FeatureExtracter..feature_extracter_fwd' “. Below is my sample code
# save model to output directory
print("Saving model...")
nlp.tokenizer = None
ner_model = pickle.dumps(nlp)
Moreover, I tried to build simple Chinese and Thai NER models, and I could save these models successfully using nlp.to_disk method. I wonder if there is something wrong with Japanese in spaCy ver 2.0.2 that we got this error “AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk' ”
Can you help me out this trouble. Thanks so much.

buivietan on 15 Nov 2017

Hmm, this is strange! I think the difference between Japanese and Thai/Chinese is that it provides a create_tokenizer method (see here), while the others only overwrite the make_doc (see here).

What happens if you don't use Pickle and the regular nlp.to_disk() method, but tell it to disable the tokenizer, for example:

nlp.to_disk('/path/to/model', disable=['tokenizer'])

If this works, the only problem here is that you'll also need to set disable=['tokenizer'] when you load the model back in using nlp.from_disk(). So packaging a model and loading it via spacy.load() won't work out-of-the-box.

We'll think about a good way to solve this in the future. When saving out a model, spaCy should probably check if the tokenizer is serializable and if not, show a warning, but serialize anyway.

Nice to hear that Chinese and Thai worked well – this is really cool!

ines on 15 Nov 2017

❤1

@ines
Oh yeah, It worked. I could save the Japanese NER model successfully. Thanks so much :)

buivietan on 15 Nov 2017

👍1

Just pushed a fix to Japanese that implements "dummy" serialization methods on the tokenizer to prevent the error. I also found another small bug that caused the Japanese vocab to not set the lang correctly (meaning that the saved out model's meta.json had "lang": "" set, which causes an error when loading the model back in).

Just tested it locally and both to/from disk and to/from bytes now works correctly. This means you should also be able to package your Japanese model as a Python package using the spacy package command.

ines on 15 Nov 2017

👍1

I have a similar problem that I could not fix: I've trained a custom NER model that I'd like to save to the disk, and since I'm using a custom tokenizer I don't want to save the tokenizer. Here's what I did:

import spacy

nlp = spacy.load("en")
nlp.tokenizer = some_custom_tokenizer
# Train the NER model...
nlp.tokenizer = None
nlp.to_disk('/tmp/my_model', disable=['tokenizer'])

(Due to this thread I did not packaged the model)
When I try to load it, the pipeline is empty, and surprisingly, is has the default spaCy tokenizer.

nlp = spacy.blank('en').from_disk('/tmp/model', disable=['tokenizer'])

I need to load the model without the tokenizer but with the full pipeline. Any ideas? thanks.

yarongon on 30 Jan 2018

More about this issue: when I tried to load the model like this:

loaded_nlp = spacy.load('/model/directory', disable=['tokenizer'])

I got an error:

FileNotFoundError: [Errno 2] No such file or directory: '/model/directory/tokenizer'

I looked at the code of util.load_model_from_path and I think I found a bug there. Line 158 is:

return nlp.from_disk(model_path)

If the disable parameter will be added to the call, it will be possible to directly use spacy.load for loading models without specific parts:

return nlp.from_disk(model_path, disable)

yarongon on 31 Jan 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 8 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Compare operator (==) behaves unexpectedly on spacy tokens

ank-26 · 3Comments

How to flag sentences with possible multiple meanings

armsp · 3Comments

High similarity scores for antonyms

ajayrfhp · 3Comments

PhraseMatcher returns only 1 match while more than 1 rules are verified

cverluise · 3Comments

Custom component not executing when calling nlp.pipe

enerrio · 3Comments